HomeDocsAPI Reference
Kumo.ai
Docs

Column Selection

For optimal results, you should ensure that any table columns you select for Kumo ingestion meet the following criteria:

  • Clean: be sure to remove fake/synthetic data, predictions from other ML models, data for which the column definition has constantly changed over time (especially if a particular attribute ID may point to different things over time), and data that is known to be otherwise unreliable or frequently inaccurate.
  • Relevant and Mutually Exclusive: the larger the graph size (i.e., the sum across the tables in a graph), the larger the compute cost; to optimize training costs, remove columns that provide similar/duplicated information, irrelevant information, and other extraneous data.
  • Complete: the column should cover the full history across the timeframe in question (e.g., the whole record of purchases/interactions versus a user's first/last purchase, or a subscriber's most recent interaction). If this results in an oversized data set, you can provide Kumo with a compressed version that indicates changes in aggregate metrics over time (e.g., per day/week/month).

🚧

Using the wrong or unnecessary columns can lead to both degraded model performance (due to noisy pQuery inputs, or worse—due to data leakage), as well as increased pQuery training costs.