## Description

The `encoder_overrides`

field allows you to configure the way Kumo processes your input data and override the encoders that are inferred by Kumo.

### Supported Task Types

- All

Nested fields require *exactly* four spaces. Each column must be specified with two *case sensitive* fields:

```
encoder_overrides:
<table_name>.<column_name>: <encoder>
```

### Example

```
encoder_overrides:
MOVIES.overview: GloVe(model_name="glove.6B", embedding_dim=50)
MOVIES.tag_line: Null()
MOVIES.budget: Numerical(scaler="minmax", na_strategy="zero")
```

### Supported Encoders

Column Type | Encoder | Argument | Default | Supported Value | Description |
---|---|---|---|---|---|

Categorical, ID | `Index` | `na_strategy` | `"separate"` | `"zero"` `"separate"` `"most_frequent"` | When set to `"zero"` , embeddings for missing values are represented as zero vectors. When set to `"separate"` , missing values are treated as a distinct category. When set to `"most_frequent"` , missing values are assigned to the most prevalent category. |

Categorical, ID | `Index` | `min_occ` | `1` | positive integer | The minimal count to allow within each category. If a category count is lower than `min_occ` , Kumo treats the category as N/A. |

Categorical, ID | `Hash` | `na_strategy` | `"separate"` | `"zero"` `"separate"` `"most_frequent"` | When set to `"zero"` , embeddings for missing values are represented as zero vectors. When set to `"separate"` , missing values are treated as a distinct category. When set to `"most_frequent"` , missing values are assigned to the most prevalent category. |

Categorical, ID | `Hash` | `num_components` | Depends on cardinality of the column | positive integer | The capacity of the hash table. |

Categorical, ID | `Hash` | `min_occ` | Depends on cardinality of the column | positive integer | The minimal count to allow within each category. If a category count is lower than `min_occ` , Kumo treats the category as N/A. |

Categorical, ID | `Hash` | `na_strategy` | `"zero"` | `"zero"` `"separate"` `"most_frequent"` | When set to `"zero"` , embeddings for missing values are represented as zero vectors. When set to `"separate"` , missing values are treated as a distinct category. When set to `"most_frequent"` , missing values are assigned to the most prevalent category. |

Multicategorical | `MultiCategorical` | `min_occ` | `1` | positive integer | The minimal count to allow within each category. If a category count is lower than `min_occ` , Kumo treats the category as N/A. |

Multicategorical | `MultiCategorical` | `sep` | Inferred by Kumo | string | The separator to use. |

Numerical | `Numerical` | `scaler` | `None` | `None` `"standard"` `"minmax"` `"robust"` | When set to `None` , no transformation is applied to the column values. When set to `"standard"` , the column values are transformed to have zero mean and unit variance. When set to `"minmax"` , the values are scaled to fall within the range [0, 1]. When set to `"robust"` , values are subtracted from the feature's median and divided by the interquartile range. |

Numerical | `Numerical` | `na_strategy` | `"mean"` | `"mean"` `"zero"` | If `"mean"` , N/A values are replaced with the mean value of the column. If `"zero"` , N/A values are replaced with zero. |

Numerical | `MaxLogNumerical` | `na_strategy` | `"mean"` | `"mean"` `"zero"` | If `"mean"` , N/A values are replaced with the mean value of the column. If `"zero"` , N/A values are replaced with zero. |

Numerical | `MinLogNumerical` | `na_strategy` | `"mean"` | `"mean"` `"zero"` | If `"mean"` , N/A values are replaced with the mean value of the column. If `"zero"` , N/A values are replaced with zero. |

Embedding | `NumericalList` | `na_strategy` | `"zero"` | `"zero"` | If `"zero"` , N/A values are replaced with zero. |

Timestamp | `Datetime` | `include_minute` | `true` | `true` `false` | Whether to include minute. |

Timestamp | `Datetime` | `include_hour` | `true` | `true` `false` | Whether to include hour. |

Timestamp | `Datetime` | `include_day_of_week` | `true` | `true` `false` | Whether to include day of week. |

Timestamp | `Datetime` | `include_day_of_month` | `true` | `true` `false` | Whether to include day of month. |

Timestamp | `Datetime` | `include_day_of_year` | `true` | `true` `false` | Whether to include day of year. |

Timestamp | `Datetime` | `include_year` | `true` | `true` `false` | Whether to include year. |

Timestamp | `Datetime` | `num_year_periods` | Depends on the diff between the min and max year in the column | positive integer | The number of periods to consider for encoding years, e.g., in case `num_year_periods=4` , year is encoded as `year % i` for each `i` in `{ 2, 4, 8, 16 }` . If set to `None` , it will be inferred based on dataset statistics. |

Text | `GloVe` | `model_name` | `"glove.840B"` | `"glove.6B"` `"glove.42B"` `"glove.740B"` `"glove_twitter.27B"` | The pretrained model name. |

Text | `GloVe` | `embedding_dim` | `300` | `25` `50` `100` `200` `300` | The embedding dimension of the pretrained model. Note that not all models support these embedding dimensions. See the GloVe Argument Combinations table below. |

Any type | `Null` | n/a | n/a | n/a | If `Null` is specified to a column, Kumo ignores this column completely. |

### GloVe Argument Combinations

`model_name` | `embedding_dim` |
---|---|

`"glove.6B"` | `50` , `100` , `200` , `300` |

`"glove.42B"` | `300` |

`"glove.840B"` | `300` |

`"glove_twitter.27B"` | `25` , `50` , `100` , `200` |

Updated 4 months ago