encoder_overrides

Description

The encoder_overrides field allows you to configure the way Kumo processes your input data and override the encoders that are inferred by Kumo.

Supported Task Types

Nested fields require exactly four spaces. Each column must be specified with two case sensitive fields:

encoder_overrides:
    <table_name>.<column_name>: <encoder>

Example

This example overrides three of the columns in MOVIES table to change the default behavior.

encoder_overrides:
    MOVIES.overview: GloVe(model_name="glove.42B", embedding_dim=300)
    MOVIES.tag_line: Null()
    MOVIES.budget: Numerical(scaler="minmax", na_strategy="zero")

With this example above, Kumo handles the overriden columns as follows:

For text column MOVIES.overview, it uses the model "glove.42B" instead of the default "glove.6B". See "GloVe Argument Combinations" section below on this page for available options, and see GloVe project for details.
Kumo ignores the column MOVIES.tag_line, and the column has no impact on modelling as it is set to Null().
For numerical column MOVIES.budget, Kumo scales numerical values within the range [0, 1] as it is set to "minmax" instead of the default "standard".

Supported Encoders

Column Type	Encoder	Argument	Default	Supported Value	Description
Categorical, ID	`Index`	`na_strategy`	`"separate"`	`"zero"` `"separate"` `"most_frequent"`	When set to `"zero"`, embeddings for missing values are represented as zero vectors. When set to `"separate"`, missing values are treated as a distinct category. When set to `"most_frequent"`, missing values are assigned to the most prevalent category.
Categorical, ID	`Index`	`min_occ`	`1`	positive integer	The minimal count to allow within each category. If a category count is lower than `min_occ`, Kumo treats the category as N/A.
Categorical, ID	`Hash`	`na_strategy`	`"separate"`	`"zero"` `"separate"` `"most_frequent"`	When set to `"zero"`, embeddings for missing values are represented as zero vectors. When set to `"separate"`, missing values are treated as a distinct category. When set to `"most_frequent"`, missing values are assigned to the most prevalent category.
Categorical, ID	`Hash`	`num_components`	Depends on cardinality of the column	positive integer	The capacity of the hash table.
Categorical, ID	`Hash`	`min_occ`	Depends on cardinality of the column	positive integer	The minimal count to allow within each category. If a category count is lower than `min_occ`, Kumo treats the category as N/A.
Categorical, ID	`Hash`	`na_strategy`	`"zero"`	`"zero"` `"separate"` `"most_frequent"`	When set to `"zero"`, embeddings for missing values are represented as zero vectors. When set to `"separate"`, missing values are treated as a distinct category. When set to `"most_frequent"`, missing values are assigned to the most prevalent category.
Multicategorical	`MultiCategorical`	`min_occ`	`1`	positive integer	The minimal count to allow within each category. If a category count is lower than `min_occ`, Kumo treats the category as N/A.
Multicategorical	`MultiCategorical`	`sep`	Inferred by Kumo	string	The separator to use.
Numerical	`Numerical`	`scaler`	`None`	`None` `"standard"` `"minmax"` `"robust"`	When set to `None`, no transformation is applied to the column values. When set to `"standard"`, the column values are transformed to have zero mean and unit variance. When set to `"minmax"`, the values are scaled to fall within the range `[0, 1]`. When set to `"robust"`, values are subtracted from the feature's median and divided by the interquartile range.
Numerical	`Numerical`	`na_strategy`	`"mean"`	`"mean"` `"zero"`	If `"mean"`, N/A values are replaced with the mean value of the column. If `"zero"`, N/A values are replaced with zero.
Numerical	`MaxLogNumerical`	`na_strategy`	`"mean"`	`"mean"` `"zero"`	If `"mean"`, N/A values are replaced with the mean value of the column. If `"zero"`, N/A values are replaced with zero.
Numerical	`MinLogNumerical`	`na_strategy`	`"mean"`	`"mean"` `"zero"`	If `"mean"`, N/A values are replaced with the mean value of the column. If `"zero"`, N/A values are replaced with zero.
Embedding	`NumericalList`	`na_strategy`	`"zero"`	`"zero"`	If `"zero"`, N/A values are replaced with zero.
Timestamp	`Datetime`	`include_minute`	`true`	`true` `false`	Whether to include minute.
Timestamp	`Datetime`	`include_hour`	`true`	`true` `false`	Whether to include hour.
Timestamp	`Datetime`	`include_day_of_week`	`true`	`true` `false`	Whether to include day of week.
Timestamp	`Datetime`	`include_day_of_month`	`true`	`true` `false`	Whether to include day of month.
Timestamp	`Datetime`	`include_day_of_year`	`true`	`true` `false`	Whether to include day of year.
Timestamp	`Datetime`	`include_year`	`true`	`true` `false`	Whether to include year.
Timestamp	`Datetime`	`num_year_periods`	Depends on the difference between the min and max year in the column	positive integer	The number of periods to consider for encoding years, e.g., in case `num_year_periods=4`, year is encoded as `year % i` for each `i` in `{ 2, 4, 8, 16 }`. If set to `None`, it will be inferred based on dataset statistics.
Text	`GloVe`	`model_name`	`"glove.6B"`	`"glove.6B"` `"glove.42B"` `"glove.840B"` `"glove_twitter.27B"`	The pretrained model name.
Text	`GloVe`	`embedding_dim`	`50`	`25` `50` `100` `200` `300`	The embedding dimension of the pretrained model. Note that not all models support these embedding dimensions. See the GloVe Argument Combinations table below.
Any type	`Null`	n/a	n/a	n/a	If `Null` is specified to a column, Kumo ignores this column completely.

GloVe Argument Combinations

`model_name`	`embedding_dim`
`"glove.6B"`	`50`, `100`, `200`, `300`
`"glove.42B"`	`300`
`"glove.840B"`	`300`
`"glove_twitter.27B"`	`25`, `50`, `100`, `200`