split

Description

Kumo will generate a training table with entities and labels according to your predictive query to be used in GNN training. A default split method will be used to split the generated training table to 3 disjoint train, validation and test sets for training the predictive query

When the predictive query is temporal node prediction, the default split method isApproxDateOffsetSplit([0.8, 0.1, 0.1]).
When the predictive query is temporal link prediction, the default split method is DateOffsetSplit([-target_aggregation_end_date, 0])
When the predictive query is static and the entity or the target table has a time column, the default split method is TemporalSplit([0.8, 0.1, 0.1]).
When the predictive query is static and the entity and target table does not have a time column, the default split method is RandomSplit([0.8, 0.1, 0.1]).

You can specify a custom split method and parameter to generate a training, validation, and test set by setting the splitvalue in the Model Planner input field.

Supported Task Types

Supported Query Types

ApproxDateOffsetSplit: Only support temporal query

DateOffsetSplit: Only support temporal query

RandomSplit: Support all query types, more suitable for static queries

TemporalSplit: Support all query types, more suitable for static queries

TimeRangeSplit: Support all query types

Methods

Method	Purpose	Details
RandomSplit([train_ratio, val_ratio, test_ratio])	Defines a random split of the training table according to `train_ratio`,`val_ratio` and `test_ratio`.	Kumo first shuffles all the training data, then selects the first `int(len(train_table) * train_ratio)` rows as train data, then select the first `int(len(train_table) * val_ratio)` rows after removing the train data from training table as val data and select the first `int(len(train_table) * test_ratio)` rows after removing both training data and validation data from training table as test data.
DateOffsetSplit([val_offset, test_offset], unit)	Defines a date offset split of training table according to the (relative) offsets from the max date in training table.	Training data with anchor date larger or equal to `max_date - target_aggregation_end_date + test_offset` and prediction horizon end date smaller or equal to `max_date` will be test data. Training data with anchor date larger or equal to `max_date - target_aggregation_end_date + val_offset` and prediction horizon end date smaller or equal to `max_date - target_aggregation_end_date + test_offset` will be val data, and the rest will be train data. `unit` defines the time unit of `val_offset` and `test_offset`. `unit` is default to be the same as the one used in target aggregation. `unit` can be set to be `months` or `hours` if needed.
TemporalSplit([train_ratio, val_ratio, test_ratio])	Defines a temporal split of the training table according to `train_ratio`, `val_ratio` and `test_ratio`.	Kumo first sorts the training table according to its time column, then select the first `int(len(train_table) * train_ratio)` rows as train data, then select the first`int(len(train_table) * val_ratio)` rows after removing the train data from training table as val data and select the first `int(len(train_table) * test_ratio)` rows after removing both training data and validation data from training table as test data.
TimeRangeSplit([(train_date_start, train_date_end), (val_data_start, val_date_end), (test_date_start, test_date_end)])	Defines a time range split of the training table according to a list of given start and end times, `train_date_start`, `train_date_end`, `val_data_start`, `val_date_end`, `test_date_start` and `test_date_end`. Format: `YYYY-MM-DD` or `YYYY-MM-DDTHHMMSS`	Kumo uses the three sets of time range splits for specifying the exact start/end dates of the train, valid, and test sets. The data-generating procedure (performed separately for each of the three sets) is the same as for `TemporalSplit` and `DateOffsetSplit`, with the exception that `max_timestamp` in the data is ignored in favor of the user-defined end of each interval. Additionally, for cases with target aggregations using a non-zero start offset, Kumo ignores the first offset-worth of data to avoid data leakage.

Example

split: ApproxDateOffsetSplit([0.8, 0.1, 0.1])
split: DateOffsetSplit([-360, -180], unit='days')
split: RandomSplit([0.7, 0.2, 0.1])
split: TemporalSplit([0.7, 0.2, 0.1])
split: TimeRangeSplit([("1994-01-01", "1997-01-01"), ("1997-01-01", "1998-01-01"), ("1998-01-01", "1999-01-01")])

Timeframe Generation From Splits

When specifying a split for a temporal query where you want specific date + time windows, you can use TimeRangeSplit or DateOffsetSplit to define the training, validation, and holdout data splits. Once the desired time windows are defined in the split config, we will generate timestamps by taking the timeframe_step (either explicitly defined in the model plan or inferred from the query) and counting back from the end of each time window. For some splits, especially TimeRangeSplit, this can lead to some data being missing if the time window that was requested is not the same size as the timeframe step.

For example, using the following split with a timeframe step of 30 days

TimeRangeSplit([
  ('2023-04-01', '2023-06-01'),
  ('2023-06-01', '2023-07-01'),
  ('2023-07-01', '2023-08-01')
])

will generate the following timeframes (where timeframe start is open and timeframe end is closed)

Holdout: (2023-07-02 to 2023-08-01] (2023-08-01 - 30 days) to (2023-08-01 - 0 days)
Validation: (2023-06-01 to 2023-07-01] (2023-07-01 - 30 days) to (2023-07-01 - 0 days)
Training:
(2023-05-02 to 2023-06-01] (2023-06-01 - 30 days) to (2023-06-01 - 0 days)
(2023-04-02 to 2023-05-02] (2023-05-02 - 30 days) to (2023-05-02 - 0 days)