HomeDocsAPI Reference
Kumo.ai
Docs

split

split: (ApproxDateOffsetSplit | DateOffsetSplit | RandomSplit | TemporalSplit | TimeRangeSplit) (Optional)

Description

Kumo will generate a training table with entities and labels according to your predictive query to be used in GNN training. A default split method will be used to split the generated training table to 3 disjoint train, validation and test sets for training the predictive query

  • When the predictive query is temporal node prediction, the default split method isApproxDateOffsetSplit([0.8, 0.1, 0.1]).
  • When the predictive query is temporal link prediction, the default split method is DateOffsetSplit([-timeframe, 0])
  • When the predictive query is static and the entity or the target table has a time column, the default split method is TemporalSplit([0.8, 0.1, 0.1]).
  • When the predictive query is static and the entity and target table does not have a time column, the default split method is RandomSplit([0.8, 0.1, 0.1]).

You can specify a custom split method and parameter to generate a training, validation, and test set by setting the splitvalue in the Model Planner input field.

Supported Task Types

  • All

Supported Query Types

ApproxDateOffsetSplit: Only support temporal query

DateOffsetSplit: Only support temporal query

RandomSplit: Support all query types, more suitable for static queries

TemporalSplit: Support all query types, more suitable for static queries

TimeRangeSplit: Support all query types

Methods

MethodPurposeDetails
RandomSplit([train_ratio, val_ratio, test_ratio])Defines a random split of the training table according to train_ratio,val_ratio and test_ratio.Kumo first shuffles all the training data, then selects the first int(len(train_table) * train_ratio) rows as train data, then select the first int(len(train_table) * val_ratio) rows after removing the train data from training table as val data and select the first int(len(train_table) * test_ratio) rows after removing both training data and validation data from training table as test data.
DateOffsetSplit([val_offset, test_offset], unit)Defines a date offset split of training table according to the (relative) offsets from the max date in training table. Training data with anchor date larger or equal to max_date - test_offset - timeframe and prediction horizon end date smaller or equal to max_date will be test data. Training data with anchor date larger or equal to max_date - val_offset - timeframe and prediction horizon end date smaller or equal to max_date - test_offset - timeframe
will be val data, and the rest will be train data. unit defines the time unit of val_offset and test_offset. unit is default to be the same as the one used in target aggregation. unit can be set to be months or hours if needed.
TemporalSplit([train_ratio, val_ratio, test_ratio])Defines a temporal split of the training table according to train_ratio, val_ratio and test_ratio.Kumo first sorts the training table
according to its time column, then select the first
int(len(train_table) * train_ratio) rows as train data, then select the firstint(len(train_table) * val_ratio) rows after removing the train data from training table as val data and select the first int(len(train_table) * test_ratio) rows after removing both training data and validation data from training table as test data.
TimeRangeSplit([(train_date_start, train_date_end),
(val_data_start, val_date_end), (test_date_start, test_date_end)])
Defines a time range split of the training table according to a list of given start and end times, train_date_start, train_date_end, val_data_start,
val_date_end, test_date_start and test_date_end.
Kumo uses the three sets of time range splits for specifying the exact start/end dates of the train, valid, and test sets. The data-generating procedure (performed separately for each of the three sets) is the same as for TemporalSplit and DateOffsetSplit, with the exception that max_timestamp in the data is ignored in favor of the user-defined end of each interval. Additionally, for cases with target aggregations using a non-zero start offset, Kumo ignores the first offset-worth of data to avoid data leakage.

Example

split: ApproxDateOffsetSplit([0.8, 0.1, 0.1])
split: DateOffsetSplit([-360, -180], unit='days')
split: RandomSplit([0.7, 0.2, 0.1])
split: TemporalSplit([0.7, 0.2, 0.1])
split: TimeRangeSplit([("1994-01-01", "1997-01-01"), ("1997-01-01", "1998-01-01"), ("1998-01-01", "1999-01-01")])

Timeframe Generation From Splits

Each of these methods defines the date+time windows that define the training, validation, and holdout data splits. Once these time windows are defined, we will generate timestamps by taking the timeframe_step (either explicitly defined in the model plan or inferred from the query) and counting back from the end of each time window. For some splits, especially TimeRangeSplit, this can lead to some data being missing if the time window that was requested is not the same size as the timeframe step.

For example, using the following split with a timeframe step of 30 days

TimeRangeSplit([
  ('2023-04-01', '2023-06-01'),
  ('2023-06-01', '2023-07-01'),
  ('2023-07-01', '2023-08-01')
])

will generate the following timeframes (where timeframe start is open and timeframe end is closed)

Holdout: (2023-07-02 to 2023-08-01] (2023-08-01 - 30 days) to (2023-08-01 - 0 days)
Validation: (2023-06-01 to 2023-07-01] (2023-07-01 - 30 days) to (2023-07-01 - 0 days)
Training:
(2023-05-02 to 2023-06-01] (2023-06-01 - 30 days) to (2023-06-01 - 0 days)
(2023-04-02 to 2023-05-02] (2023-05-02 - 30 days) to (2023-05-02 - 0 days)