Description
The encoder_overrides
field allows you to configure the way Kumo processes your input data and override the encoders that are inferred by Kumo.
Supported Task Types
- All
Nested fields require exactly four spaces. Each column must be specified with two case sensitive fields:
encoder_overrides:
<table_name>.<column_name>: <encoder>
Example
This example overrides three of the columns in MOVIES
table to change the default behavior.
encoder_overrides:
MOVIES.overview: GloVe(model_name="glove.42B", embedding_dim=300)
MOVIES.tag_line: Null()
MOVIES.budget: Numerical(scaler="minmax", na_strategy="zero")
With this example above, Kumo handles the overriden columns as follows:
- For text column
MOVIES.overview
, it uses the model"glove.42B"
instead of the default"glove.6B"
. See "GloVe Argument Combinations" section below on this page for available options, and see GloVe project for details. - Kumo ignores the column
MOVIES.tag_line
, and the column has no impact on modelling as it is set toNull()
. - For numerical column
MOVIES.budget
, Kumo scales numerical values within the range[0, 1]
as it is set to"minmax"
instead of the default"standard"
.
Supported Encoders
Column Type | Encoder | Argument | Default | Supported Value | Description |
---|---|---|---|---|---|
Categorical, ID |
|
|
|
| When set to |
Categorical, ID |
|
|
| positive integer | The minimal count to allow within each category. If a category count is lower than |
Categorical, ID |
|
|
|
| When set to |
Categorical, ID |
|
| Depends on cardinality of the column | positive integer | The capacity of the hash table. |
Categorical, ID |
|
| Depends on cardinality of the column | positive integer | The minimal count to allow within each category. If a category count is lower than |
Categorical, ID |
|
|
|
| When set to |
Multicategorical |
|
|
| positive integer | The minimal count to allow within each category. If a category count is lower than |
Multicategorical |
|
| Inferred by Kumo | string | The separator to use. |
Numerical |
|
|
|
| When set to |
Numerical |
|
|
|
| If |
Numerical |
|
|
|
| If |
Numerical |
|
|
|
| If |
Embedding |
|
|
|
| If |
Timestamp |
|
|
|
| Whether to include minute. |
Timestamp |
|
|
|
| Whether to include hour. |
Timestamp |
|
|
|
| Whether to include day of week. |
Timestamp |
|
|
|
| Whether to include day of month. |
Timestamp |
|
|
|
| Whether to include day of year. |
Timestamp |
|
|
|
| Whether to include year. |
Timestamp |
|
| Depends on the difference between the min and max year in the column | positive integer | The number of periods to consider for encoding years, e.g., in case |
Text |
|
|
|
| The pretrained model name. |
Text |
|
|
|
| The embedding dimension of the pretrained model. Note that not all models support these embedding dimensions. See the GloVe Argument Combinations table below. |
Any type |
| n/a | n/a | n/a | If |
GloVe Argument Combinations
model_name | embedding_dim |
---|---|
"glove.6B" | 50 , 100 , 200 , 300 |
"glove.42B" | 300 |
"glove.840B" | 300 |
"glove_twitter.27B" | 25 , 50 , 100 , 200 |
Updated 6 days ago