How can I start with a smaller graph and/or a downsampled data set?
In some cases, you may not want to use the entire dataset in your source table, as iterating over your dataset in its entirety can be resource intensive and time consuming. For example, you may want to start building your predictions quickly to verify that they are working as expected.
The following strategies can help you in this regard:
Reducing the Time Range of Training Data
By default, Kumo will use all data in your graph for training. So, if you have 10 years of event logs, Kumo will use all 10 years of data. This certainly increase computation cost at training time, and is often not necessary during early model development. For example, if there are enough training examples in the past 2 years of data, then it will be faster to develop the model on 2 years of data, rather than 10.
In order to reduce the range of training data, you can use the following options:
- train_start_offset: control the number of days of training data to generate
- TimeRangeSplit: Allows you to control the exact time range of the data in the training/validation/holdout set.
Downsample your input data
If you have particularly large input data, you may want to downsample your data further, even before connecting it to Kumo. For example, if you are using Snowflake as a data source, you can create a snowflake view for each of your input tables, filtering it down to specific time range (eg. 1 year of data), or randomly sampling data for a subset of users.
Note, when downsampling your data, you must have have a principled sampling approach. For example, you cannot randomly select 10% of rows from all of your tables, as this will lead to a lot of "incomplete" entries in your data, such as users that are missing transaction histories. It is recommended to downsample by the entity of your predictive query, such as "all rows in all tables where user_id % 10 == 1
".
Updated 8 days ago