Description
The encoder_overrides
field allows you to configure the way Kumo processes your input data and override the encoders that are inferred by Kumo.
Supported Task Types
- All
Nested fields require exactly four spaces. Each column must be specified with two case sensitive fields:
encoder_overrides:
<table_name>.<column_name>: <encoder>
Example
encoder_overrides:
MOVIES.overview: GloVe(model_name="glove.6B", embedding_dim=50)
MOVIES.tag_line: Null()
MOVIES.budget: Numerical(scaler="minmax", na_strategy="zero")
Supported Encoders
Column Type | Encoder | Argument | Default | Supported Value | Description |
---|---|---|---|---|---|
Categorical, ID | Index | na_strategy | "separate" | "zero" "separate" "most_frequent" | When set to "zero" , embeddings for missing values are represented as zero vectors. When set to "separate" , missing values are treated as a distinct category. When set to "most_frequent" , missing values are assigned to the most prevalent category. |
Categorical, ID | Index | min_occ | 1 | positive integer | The minimal count to allow within each category. If a category count is lower than min_occ , Kumo treats the category as N/A. |
Categorical, ID | Hash | na_strategy | "separate" | "zero" "separate" "most_frequent" | When set to "zero" , embeddings for missing values are represented as zero vectors. When set to "separate" , missing values are treated as a distinct category. When set to "most_frequent" , missing values are assigned to the most prevalent category. |
Categorical, ID | Hash | num_components | Depends on cardinality of the column | positive integer | The capacity of the hash table. |
Categorical, ID | Hash | min_occ | Depends on cardinality of the column | positive integer | The minimal count to allow within each category. If a category count is lower than min_occ , Kumo treats the category as N/A. |
Categorical, ID | Hash | na_strategy | "zero" | "zero" "separate" "most_frequent" | When set to "zero" , embeddings for missing values are represented as zero vectors. When set to "separate" , missing values are treated as a distinct category. When set to "most_frequent" , missing values are assigned to the most prevalent category. |
Multicategorical | MultiCategorical | min_occ | 1 | positive integer | The minimal count to allow within each category. If a category count is lower than min_occ , Kumo treats the category as N/A. |
Multicategorical | MultiCategorical | sep | Inferred by Kumo | string | The separator to use. |
Numerical | Numerical | scaler | None | None "standard" "minmax" "robust" | When set to None , no transformation is applied to the column values. When set to "standard" , the column values are transformed to have zero mean and unit variance. When set to "minmax" , the values are scaled to fall within the range [0, 1]. When set to "robust" , values are subtracted from the feature's median and divided by the interquartile range. |
Numerical | Numerical | na_strategy | "mean" | "mean" "zero" | If "mean" , N/A values are replaced with the mean value of the column. If "zero" , N/A values are replaced with zero. |
Numerical | MaxLogNumerical | na_strategy | "mean" | "mean" "zero" | If "mean" , N/A values are replaced with the mean value of the column. If "zero" , N/A values are replaced with zero. |
Numerical | MinLogNumerical | na_strategy | "mean" | "mean" "zero" | If "mean" , N/A values are replaced with the mean value of the column. If "zero" , N/A values are replaced with zero. |
Embedding | NumericalList | na_strategy | "zero" | "zero" | If "zero" , N/A values are replaced with zero. |
Timestamp | Datetime | include_minute | true | true false | Whether to include minute. |
Timestamp | Datetime | include_hour | true | true false | Whether to include hour. |
Timestamp | Datetime | include_day_of_week | true | true false | Whether to include day of week. |
Timestamp | Datetime | include_day_of_month | true | true false | Whether to include day of month. |
Timestamp | Datetime | include_day_of_year | true | true false | Whether to include day of year. |
Timestamp | Datetime | include_year | true | true false | Whether to include year. |
Timestamp | Datetime | num_year_periods | Depends on the diff between the min and max year in the column | positive integer | The number of periods to consider for encoding years, e.g., in case num_year_periods=4 , year is encoded as year % i for each i in { 2, 4, 8, 16 } . If set to None , it will be inferred based on dataset statistics. |
Text | GloVe | model_name | "glove.840B" | "glove.6B" "glove.42B" "glove.740B" "glove_twitter.27B" | The pretrained model name. |
Text | GloVe | embedding_dim | 300 | 25 50 100 200 300 | The embedding dimension of the pretrained model. Note that not all models support these embedding dimensions. See the GloVe Argument Combinations table below. |
Any type | Null | n/a | n/a | n/a | If Null is specified to a column, Kumo ignores this column completely. |
GloVe Argument Combinations
model_name | embedding_dim |
---|---|
"glove.6B" | 50 , 100 , 200 , 300 |
"glove.42B" | 300 |
"glove.840B" | 300 |
"glove_twitter.27B" | 25 , 50 , 100 , 200 |
Updated 5 months ago