HomeDocsAPI Reference
Kumo.ai
Docs

encoder_overrides

encoder_overrides (Optional)

Description

The encoder_overrides field allows you to configure the way Kumo processes your input data and override the encoders that are inferred by Kumo.

Supported Task Types

  • All

Nested fields require exactly four spaces. Each column must be specified with two case sensitive fields:

encoder_overrides:
    <table_name>.<column_name>: <encoder>

Example

This example overrides three of the columns in MOVIES table to change the default behavior.

encoder_overrides:
    MOVIES.overview: GloVe(model_name="glove.42B", embedding_dim=300)
    MOVIES.tag_line: Null()
    MOVIES.budget: Numerical(scaler="minmax", na_strategy="zero")

With this example above, Kumo handles the overriden columns as follows:

  • For text column MOVIES.overview, it uses the model "glove.42B" instead of the default "glove.6B". See "GloVe Argument Combinations" section below on this page for available options, and see GloVe project for details.
  • Kumo ignores the column MOVIES.tag_line, and the column has no impact on modelling as it is set to Null().
  • For numerical column MOVIES.budget, Kumo scales numerical values within the range [0, 1] as it is set to "minmax" instead of the default "standard".

Supported Encoders

Column Type

Encoder

Argument

Default

Supported Value

Description

Categorical, ID

Index

na_strategy

"separate"

"zero"
"separate"
"most_frequent"

When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category.

Categorical, ID

Index

min_occ

1

positive integer

The minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A.

Categorical, ID

Hash

na_strategy

"separate"

"zero"
"separate"
"most_frequent"

When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category.

Categorical, ID

Hash

num_components

Depends on cardinality of the column

positive integer

The capacity of the hash table.

Categorical, ID

Hash

min_occ

Depends on cardinality of the column

positive integer

The minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A.

Categorical, ID

Hash

na_strategy

"zero"

"zero"
"separate"
"most_frequent"

When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category.

Multicategorical

MultiCategorical

min_occ

1

positive integer

The minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A.

Multicategorical

MultiCategorical

sep

Inferred by Kumo

string

The separator to use.

Numerical

Numerical

scaler

None

None
"standard"
"minmax"
"robust"

When set to None, no transformation is applied to the column values. When set to "standard", the column values are transformed to have zero mean and unit variance. When set to "minmax", the values are scaled to fall within the range [0, 1]. When set to "robust", values are subtracted from the feature's median and divided by the interquartile range.

Numerical

Numerical

na_strategy

"mean"

"mean"
"zero"

If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero.

Numerical

MaxLogNumerical

na_strategy

"mean"

"mean"
"zero"

If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero.

Numerical

MinLogNumerical

na_strategy

"mean"

"mean"
"zero"

If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero.

Embedding

NumericalList

na_strategy

"zero"

"zero"

If "zero", N/A values are replaced with zero.

Timestamp

Datetime

include_minute

true

true
false

Whether to include minute.

Timestamp

Datetime

include_hour

true

true
false

Whether to include hour.

Timestamp

Datetime

include_day_of_week

true

true
false

Whether to include day of week.

Timestamp

Datetime

include_day_of_month

true

true
false

Whether to include day of month.

Timestamp

Datetime

include_day_of_year

true

true
false

Whether to include day of year.

Timestamp

Datetime

include_year

true

true
false

Whether to include year.

Timestamp

Datetime

num_year_periods

Depends on the difference between the min and max year in the column

positive integer

The number of periods to consider for encoding years, e.g., in case num_year_periods=4, year is encoded as year % i for each i in { 2, 4, 8, 16 }. If set to None, it will be inferred based on dataset statistics.

Text

GloVe

model_name

"glove.6B"

"glove.6B"
"glove.42B"
"glove.840B"
"glove_twitter.27B"

The pretrained model name.

Text

GloVe

embedding_dim

50

25
50
100
200
300

The embedding dimension of the pretrained model. Note that not all models support these embedding dimensions. See the GloVe Argument Combinations table below.

Any type

Null

n/a

n/a

n/a

If Null is specified to a column, Kumo ignores this column completely.

GloVe Argument Combinations

model_nameembedding_dim
"glove.6B"50, 100, 200, 300
"glove.42B"300
"glove.840B"300
"glove_twitter.27B"25, 50, 100, 200