Description
Kumo will generate a training table with entities and labels according to your predictive query to be used in GNN training. A default split method will be used to split the generated training table to 3 disjoint train, validation and test sets for training the predictive query
- When the predictive query is temporal node prediction, the default split method is
ApproxDateOffsetSplit([0.8, 0.1, 0.1])
. - When the predictive query is temporal link prediction, the default split method is
DateOffsetSplit([-target_aggregation_end_date, 0])
- When the predictive query is static and the entity or the target table has a time column, the default split method is
TemporalSplit([0.8, 0.1, 0.1])
. - When the predictive query is static and the entity and target table does not have a time column, the default split method is
RandomSplit([0.8, 0.1, 0.1])
.
You can specify a custom split method and parameter to generate a training, validation, and test set by setting the split
value in the Model Planner input field.
Supported Task Types
- All
Supported Query Types
ApproxDateOffsetSplit
: Only support temporal query
DateOffsetSplit
: Only support temporal query
RandomSplit
: Support all query types, more suitable for static queries
TemporalSplit
: Support all query types, more suitable for static queries
TimeRangeSplit
: Support all query types
Methods
Method | Purpose | Details |
---|---|---|
RandomSplit(on Kumo will generate a training ta) | Defines a random split of the training table according to | Kumo first shuffles all the training data, then selects the first |
DateOffsetSplit(Kumo will generate a trai, unit) | Defines a date offset split of training table according to the (relative) offsets from the max date in training table. | Training data with anchor date larger or equal to |
TemporalSplit( Kumo will generate a training tabl) | Defines a temporal split of the training table according to | Kumo first sorts the training table |
TimeRangeSplit( Kumo will generate a training table with entities and labels according to your predictive query to be u) | Defines a time range split of the training table according to a list of given start and end times, Format: | Kumo uses the three sets of time range splits for specifying the exact start/end dates of the train, valid, and test sets. The data-generating procedure (performed separately for each of the three sets) is the same as for |
Example
split: ApproxDateOffsetSplit([0.8, 0.1, 0.1])
split: DateOffsetSplit([-360, -180], unit='days')
split: RandomSplit([0.7, 0.2, 0.1])
split: TemporalSplit([0.7, 0.2, 0.1])
split: TimeRangeSplit([("1994-01-01", "1997-01-01"), ("1997-01-01", "1998-01-01"), ("1998-01-01", "1999-01-01")])
Timeframe Generation From Splits
When specifying a split for a temporal query where you want specific date + time windows, you can use TimeRangeSplit or DateOffsetSplit to define the training, validation, and holdout data splits. Once the desired time windows are defined in the split config, we will generate timestamps by taking the timeframe_step (either explicitly defined in the model plan or inferred from the query) and counting back from the end of each time window. For some splits, especially TimeRangeSplit, this can lead to some data being missing if the time window that was requested is not the same size as the timeframe step.
For example, using the following split with a timeframe step of 30 days
TimeRangeSplit([
('2023-04-01', '2023-06-01'),
('2023-06-01', '2023-07-01'),
('2023-07-01', '2023-08-01')
])
will generate the following timeframes (where timeframe start is open and timeframe end is closed)
Holdout: (2023-07-02 to 2023-08-01] (2023-08-01 - 30 days) to (2023-08-01 - 0 days)
Validation: (2023-06-01 to 2023-07-01] (2023-07-01 - 30 days) to (2023-07-01 - 0 days)
Training:
(2023-05-02 to 2023-06-01] (2023-06-01 - 30 days) to (2023-06-01 - 0 days)
(2023-04-02 to 2023-05-02] (2023-05-02 - 30 days) to (2023-05-02 - 0 days)
Updated 6 days ago