split
Description
Kumo will generate a training table with entities and labels according to your predictive query to be used in GNN training. A default split method will be used to split the generated training table to 3 disjoint train, validation and test sets for training the predictive query
- When the predictive query is temporal node prediction, the default split method is
ApproxDateOffsetSplit([0.8, 0.1, 0.1])
. - When the predictive query is temporal link prediction, the default split method is
DateOffsetSplit([-target_aggregation_end_date, 0])
- When the predictive query is static and the entity or the target table has a time column, the default split method is
TemporalSplit([0.8, 0.1, 0.1])
. - When the predictive query is static and the entity and target table does not have a time column, the default split method is
RandomSplit([0.8, 0.1, 0.1])
.
You can specify a custom split method and parameter to generate a training, validation, and test set by setting the split
value in the Model Planner input field.
Supported Task Types
- All
Supported Query Types
ApproxDateOffsetSplit
: Only support temporal query
DateOffsetSplit
: Only support temporal query
RandomSplit
: Support all query types, more suitable for static queries
TemporalSplit
: Support all query types, more suitable for static queries
TimeRangeSplit
: Support all query types
Methods
Method | Purpose | Details |
---|---|---|
RandomSplit([train_ratio, val_ratio, test_ratio]) | Defines a random split of the training table according to train_ratio ,val_ratio and test_ratio . | Kumo first shuffles all the training data, then selects the first int(len(train_table) * train_ratio) rows as train data, then select the first int(len(train_table) * val_ratio) rows after removing the train data from training table as val data and select the first int(len(train_table) * test_ratio) rows after removing both training data and validation data from training table as test data. |
DateOffsetSplit([val_offset, test_offset], unit) | Defines a date offset split of training table according to the (relative) offsets from the max date in training table. | Training data with anchor date larger or equal to max_date - target_aggregation_end_date + test_offset and prediction horizon end date smaller or equal to max_date will be test data. Training data with anchor date larger or equal to max_date - target_aggregation_end_date + val_offset and prediction horizon end date smaller or equal to max_date - target_aggregation_end_date + test_offset will be val data, and the rest will be train data. unit defines the time unit of val_offset and test_offset . unit is default to be the same as the one used in target aggregation. unit can be set to be months or hours if needed. |
TemporalSplit([train_ratio, val_ratio, test_ratio]) | Defines a temporal split of the training table according to train_ratio , val_ratio and test_ratio . | Kumo first sorts the training table according to its time column, then select the first int(len(train_table) * train_ratio) rows as train data, then select the firstint(len(train_table) * val_ratio) rows after removing the train data from training table as val data and select the first int(len(train_table) * test_ratio) rows after removing both training data and validation data from training table as test data. |
TimeRangeSplit([(train_date_start, train_date_end), (val_data_start, val_date_end), (test_date_start, test_date_end)]) | Defines a time range split of the training table according to a list of given start and end times, train_date_start , train_date_end , val_data_start ,val_date_end , test_date_start and test_date_end . | Kumo uses the three sets of time range splits for specifying the exact start/end dates of the train, valid, and test sets. The data-generating procedure (performed separately for each of the three sets) is the same as for TemporalSplit and DateOffsetSplit , with the exception that max_timestamp in the data is ignored in favor of the user-defined end of each interval. Additionally, for cases with target aggregations using a non-zero start offset, Kumo ignores the first offset-worth of data to avoid data leakage. |
Example
split: ApproxDateOffsetSplit([0.8, 0.1, 0.1])
split: DateOffsetSplit([-360, -180], unit='days')
split: RandomSplit([0.7, 0.2, 0.1])
split: TemporalSplit([0.7, 0.2, 0.1])
split: TimeRangeSplit([("1994-01-01", "1997-01-01"), ("1997-01-01", "1998-01-01"), ("1998-01-01", "1999-01-01")])
Timeframe Generation From Splits
When specifying a split for a temporal query where you want specific date + time windows, you can use TimeRangeSplit or DateOffsetSplit to define the training, validation, and holdout data splits. Once the desired time windows are defined in the split config, we will generate timestamps by taking the timeframe_step (either explicitly defined in the model plan or inferred from the query) and counting back from the end of each time window. For some splits, especially TimeRangeSplit, this can lead to some data being missing if the time window that was requested is not the same size as the timeframe step.
For example, using the following split with a timeframe step of 30 days
TimeRangeSplit([
('2023-04-01', '2023-06-01'),
('2023-06-01', '2023-07-01'),
('2023-07-01', '2023-08-01')
])
will generate the following timeframes (where timeframe start is open and timeframe end is closed)
Holdout: (2023-07-02 to 2023-08-01] (2023-08-01 - 30 days) to (2023-08-01 - 0 days)
Validation: (2023-06-01 to 2023-07-01] (2023-07-01 - 30 days) to (2023-07-01 - 0 days)
Training:
(2023-05-02 to 2023-06-01] (2023-06-01 - 30 days) to (2023-06-01 - 0 days)
(2023-04-02 to 2023-05-02] (2023-05-02 - 30 days) to (2023-05-02 - 0 days)
Updated about 2 months ago