How can I make my training jobs run faster?

When developing a new model on Kumo, it is often desired to experiment with many different approaches. For example:

Adding new columns/tables to see if they are important
Changing the predictive query formulation or graph structure
Tuning the model training process via the Model Planner

Training a models in BEST mode can take upwards of eight hours of GPU time, which can result in a lot of waiting. To increase iteration speed during development, we recommend a few simple tactics to trim down the data size. Once you have settled on the right predictive query, you can crank up the settings to the maximum, and let AutoML do its thing.

Start with fewer tables and columns

While it can be fun to throw all of the data at the model at once, it does come at a cost. So, it is recommended to "start small" and incrementally improve.

Try creating your first model using the bare minimum number of tables (for example, just the entity and the target table). Once you train your predictive query on this minimal dataset, you can check that everything is working as you intended. For example:

Does the label distribution over time match your expectations?
Is the training data too heavily skewed?
Is there any obvious target leakage?
Are the train/val/test splits aligned with your expectations?

Once you have confirmed that the training process is working as you expect, you can incrementally add tables one by one, and confirm that you see improvements in model performance at each step along the way.

Reducing the Time Range of Training Data

By default, Kumo will use all data in your graph for training. So, if you have ten years of event logs, Kumo will use all ten years of data. This will increase computation cost at training time, and is often not necessary during early model development. For example, if there are enough training examples in the past two years of data, then it will be faster to develop the model on two years of data, rather than ten.

In order to reduce the range of training data, you can use the following options:

train_start_offset: control the number of days of training data to generate
TimeRangeSplit: Allows you to control the exact time range of the data in the training/validation/holdout set.

Downsample your input data

If you have particularly large input data, you may want to downsample your data further, even before connecting it to Kumo. For example, if you are using Snowflake as a data source, you can create a snowflake view for each of your input tables, filtering it down to specific time range (eg. 1 year of data), or randomly sampling data for a subset of users.

Note, when downsampling your data, you must have have a principled sampling approach. For example, you cannot randomly select 10% of rows from all of your tables, as this will lead to a lot of "incomplete" entries in your data, such as users that are missing transaction histories. It is recommended to downsample by the entity of your predictive query, such as "all rows in all tables where user_id % 10 == 1".