HomeDocsAPI Reference
Kumo.ai
Docs

encoder_overrides

encoder_overrides (Optional)

Description

The encoder_overrides field allows you to configure the way Kumo processes your input data and override the encoders that are inferred by Kumo.

Supported Task Types

  • All

Nested fields require exactly four spaces. Each column must be specified with two case sensitive fields:

encoder_overrides:
    <table_name>.<column_name>: <encoder>

Example

encoder_overrides:
    MOVIES.overview: GloVe(model_name="glove.6B", embedding_dim=50)
    MOVIES.tag_line: Null()
    MOVIES.budget: Numerical(scaler="minmax", na_strategy="zero")

Supported Encoders

Column TypeEncoderArgumentDefaultSupported ValueDescription
Categorical, IDIndexna_strategy"separate""zero"
"separate"
"most_frequent"
When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category.
Categorical, IDIndexmin_occ1positive integerThe minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A.
Categorical, IDHashna_strategy"separate""zero"
"separate"
"most_frequent"
When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category.
Categorical, IDHashnum_componentsDepends on cardinality of the columnpositive integerThe capacity of the hash table.
Categorical, IDHashmin_occDepends on cardinality of the columnpositive integerThe minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A.
Categorical, IDHashna_strategy"zero""zero"
"separate"
"most_frequent"
When set to "zero", embeddings for missing values are represented as zero vectors. When set to "separate", missing values are treated as a distinct category. When set to "most_frequent", missing values are assigned to the most prevalent category.
MulticategoricalMultiCategoricalmin_occ1positive integerThe minimal count to allow within each category. If a category count is lower than min_occ, Kumo treats the category as N/A.
MulticategoricalMultiCategoricalsepInferred by KumostringThe separator to use.
NumericalNumericalscalerNoneNone
"standard"
"minmax"
"robust"
When set to None, no transformation is applied to the column values. When set to "standard", the column values are transformed to have zero mean and unit variance. When set to "minmax", the values are scaled to fall within the range [0, 1]. When set to "robust", values are subtracted from the feature's median and divided by the interquartile range.
NumericalNumericalna_strategy"mean""mean"
"zero"
If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero.
NumericalMaxLogNumericalna_strategy"mean""mean"
"zero"
If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero.
NumericalMinLogNumericalna_strategy"mean""mean"
"zero"
If "mean", N/A values are replaced with the mean value of the column. If "zero", N/A values are replaced with zero.
EmbeddingNumericalListna_strategy"zero""zero"If "zero", N/A values are replaced with zero.
TimestampDatetimeinclude_minutetruetrue
false
Whether to include minute.
TimestampDatetimeinclude_hourtruetrue
false
Whether to include hour.
TimestampDatetimeinclude_day_of_weektruetrue
false
Whether to include day of week.
TimestampDatetimeinclude_day_of_monthtruetrue
false
Whether to include day of month.
TimestampDatetimeinclude_day_of_yeartruetrue
false
Whether to include day of year.
TimestampDatetimeinclude_yeartruetrue
false
Whether to include year.
TimestampDatetimenum_year_periodsDepends on the diff between the min and max year in the columnpositive integerThe number of periods to consider for encoding years, e.g., in case num_year_periods=4, year is encoded as year % i for each i in { 2, 4, 8, 16 }. If set to None, it will be inferred based on dataset statistics.
TextGloVemodel_name"glove.840B""glove.6B"
"glove.42B"
"glove.740B"
"glove_twitter.27B"
The pretrained model name.
TextGloVeembedding_dim30025
50
100
200
300
The embedding dimension of the pretrained model. Note that not all models support these embedding dimensions. See the GloVe Argument Combinations table below.
Any typeNulln/an/an/aIf Null is specified to a column, Kumo ignores this column completely.

GloVe Argument Combinations

model_nameembedding_dim
"glove.6B"50, 100, 200, 300
"glove.42B"300
"glove.840B"300
"glove_twitter.27B"25, 50, 100, 200