Column Selection
For optimal results, you should ensure that any table columns you select for Kumo ingestion meet the following criteria:
- Clean: be sure to remove fake/synthetic data, predictions from other ML models, data for which the column definition has constantly changed over time (especially if a particular attribute ID may point to different things over time), and data that is known to be otherwise unreliable or frequently inaccurate.
- Relevant and Mutually Exclusive: the larger the graph size (i.e., the sum across the tables in a graph), the larger the compute cost; to optimize training costs, remove columns that provide similar/duplicated information, irrelevant information, and other extraneous data.
- Complete: the column should cover the full history across the timeframe in question (e.g., the whole record of purchases/interactions versus a user's first/last purchase, or a subscriber's most recent interaction). If this results in an oversized data set, you can provide Kumo with a compressed version that indicates changes in aggregate metrics over time (e.g., per day/week/month).
Using the wrong or unnecessary columns can lead to both degraded model performance (due to noisy pQuery inputs, or worse—due to data leakage), as well as increased pQuery training costs.
Updated 8 months ago