Which datasets should I use for my predictive query?
Best practices for selecting the right datasets to use in a predictive query.
While Kumo eliminates the need to perform traditional feature engineering, it is still important to select the right data for your model. In short: Bad data = bad model
Kumo uses Graph Neural Networks (GNNs) under the hood. While these neural network types usually yield better performance than regression and decision trees on relational data, GNNs need to be given the right data to produce good results.
The following are some tips and tricks for identifying the proper datasets for your new predictive query.
Data Quality Checklist
- There is a table that describes the label of your predictive task, which can be used to create positive and negative training examples.
- There is enough historical data to learn seasonal patterns —typically 1-2 years, but can be more.
- The data will fit in the Kumo recommended data size limits.
- Any PII or sensitive data can be hashed/obfuscated/removed, at your discresion.
- There is no target leakage, such as columns that are mutated/updated over time.
- The dataset contains all of the signals that you expect to be predictive from a business perspective.
- The tables can be all linked together with primary/foreign keys.
Use The Raw Data
Whenever possible, avoid performing any aggregation or feature engineering on the data before passing it to Kumo. Instead, share the raw dimension and fact tables directly as they exist in your database. For example, rather than sharing a pre-aggregated feature such as "number of logins in the past 30 days", share a table containing the list of timestamped login attempts. The reason is that Kumo's time-aware GNN learns from patterns/sequences/relationships in the data. When you perform aggregations on the data, you drop information that the model needs. In our experience, using aggregated features can cause Kumo to perform 5-10% worse than an equivalent GBDT model, while giving Kumo the raw data will make it perform better.
Think About The Business
Kumo's models can capture a lot of the "common sense" signals that a human would typically use to make decisions. For example, if you are trying to predict the probability that a customer will churn, try to think of the actions/events/behaviors that might lead to churning, and make sure that these events are somehow captured in the dataset.
Leverage Relationships
The relationships between tables (eg. foreign keys) is one of the most important signals for the Kumo model. Try to import as much relational data as possible. For example, if you are importing a table representing as user's past purchase history, try to link each transaction to an item, and link that item to its seller, its brand, and its location in your product taxonomy. The Kumo GNN will automatically figure out which relationships are important and which ones are not, so don't worry about adding too much data.
Avoid Downsampling
Avoid downsampling your tables prior to loading them into Kumo. Kumo's model planner will give you fine-grained control the training table and graph neighborhood sampling algorithms, so there is no need to perform sampling outside of Kumo. In the cases where you must perform downsampling before loading data into Kumo (either due to data size or regulatory constraints), please chat with your point of contact at Kumo. While it is fairly straightforward to ingest pre-sampled data in Kumo, sampling can introduce problems if not done properly across all tables in the graph.
Use Rich Signals
Unlike classical machine learning models, deep learning models like GNNs excel at leveraging rich, content-based signals such as text and images. Try to load rich, text-based signals to your data (e.g., landing pages, descriptions, etc.); additionally, for an extra performance boost, you can load image signals into Kumo, in the form of embeddings.
Once you have selected which tables to use, you should ensure that they conform to Kumo's table schema guidelines. These guidelines are designed to help avoid common ML mistakes that could hurt model performance, such as data leakage.
Learn More:
Updated 4 months ago