Static vs. Temporal Predictive Queries

Predictive Queries are mainly used to generate a training table that gets attached to an underlying graph. Static queries that don't involve making predictions over some window of time do not require a time column in the target table(s); in these cases, by default Kumo generates training/validation/holdout splits by randomly distributing the rows according to an 80/10/10 ratio. While the time column is not required, it is allowed and you may also distribute the rows according to a specific time range.

In contrast, temporal queries that predict some aggregation of values over time (e.g., "purchases each customer will make over the next 7 days") are a bit more complicated—data splits need to be non-overlapping, properly ordered, and well-balanced in terms of size, or else data leakage may occur and invalidate the prediction results. Fortunately, Kumo handles all of this for you automatically, splitting the data into "train/test/validation" splits, based on the time column in your target table.

📘
For more information about how to specify the Split you would like to use, you can refer to the documentation here

Take a look at the following Predictive Query for predicting events over a 30 day window—specifically, customers that will refrain from purchasing over the next 30 days:

PREDICT COUNT(TRANSACTIONS.*, 0, 30,days) = 0
FOR EACH CUSTOMERS.CUSTOMER_ID

To generate each training example, Kumo travels back in time and "replays" the behavior of each user at specific times in the past, sampled at the appropriate rate, then automatically generates the proper sampling and training split methodology, based on your dataset and Predictive Query, as depicted below:

Kumo analyzes your Predictive Query and data for the optimal sample rates and splits for generating training examples. For temporal queries, Kumo ensures that the holdout split is strictly later in time than the training split, and that the training splits are well-balanced in size. This ensures optimal performance out-of-the-gate for any predictive query and eliminates errors related to manual training split setup activities.

Consider the following Predictive Query for predicting the total number of sales each customer will make in the next 30 days:

PREDICT SUM(TRANSACTIONS.PRICE, 0, 30,days)
FOR EACH CUSTOMERS.CUSTOMER_ID

For this Predictive Query, Kumo will generate a set of training/validation/holdout splits that cover the time range of the target table transactions. For example, If your target table has a time range of September 20, 2018 to September 22, 2020, training examples will be generated by computing the 30-day spend of users, at various points in time during this time range. Kumo will then automatically generate the proper sampling and training split methodology accordingly.

📘For more information about how to specify the Split you would like to use, you can refer to the documentation here

📘
For more information about how to specify the Split you would like to use, you can refer to the documentation here