Column Preprocessing
Based on the type of column ingested, you can specify what kinds of preprocessing steps to apply.
Kumo supports preprocessing for the following generic types:
- Numerical: integers and floats where the ordering of numbers from lower to higher values has semantic meaning (e.g., product price, percentage discount).
- Categorical: string values or boolean values that are typically only a single token in length, where the language content of the string has limited semantic meaning, and there are a limited number of unique values—up to 4000, by default (e.g., a boolean value representing whether a member has activated their premium subscription or not or product type indicating whether the product in electronics, clothing, shoes etc.).
- Multi-categorical: comma-separated variable length lists of categorical values (e.g., a list of tags associated with a restaurant like "vegetarian", "italian", "pickup_only").
- ID: integer/string/float values with no semantic meaning in their number ordering, if any (i.e., low/high, high/low). Typically, the key columns used to link tables are of the ID type (e.g., customer ID, product group number).
- Text: string values are typically multiple tokens in length, where the actual language content of the value has semantic meaning and should be incorporated into any predictive modeling (e.g., product description, product review).
- Timestamp: string or format-specific date/timestamp values; if the column contains string values, they must be of valid date/time format (preferably ISO 8601), it is a numeric value, it should represent epoch time. If data provided to Kumo is in type-safe Parquet format, columns with dates/times should be cast to one of the DATE/TIME/TIMESTAMP data types before loading.
- Embedding: lists of floats, all of equal length; typically latent representations from another AI model.
Nested schemas and complex types such as arrays and maps are not supported at this time. To use these complex column types in Kumo, first transform them to a string type (i.e., comma-separated strings).
For example, if your column is an array of strings like ["TV", "electronics", "promotion"]
, you should transform it to a string like "TV, electronics, promotion".
Note: Kumo will alert you if any of these unsupported column types are detected.
Furthermore, Kumo does not yet offer domain-specific preprocessing for the following data types:
- Full, Raw URLs: for preprocessing outside of Kumo, extract semantically important components of the URL and treat them as categorical values
- Lat/Long Coordinates: for preprocessing outside of Kumo, convert to specific geographic areas which you treat as categorical values
- IP Addresses: for preprocessing outside of Kumo, remove any personal identifiable information (PII), extract high-level components (e.g., subnet), and treat them as categorical elements.
- Phone Numbers: for preprocessing outside of Kumo, remove any personal identifiable information (PII), extract high-level components (e.g., area code), and treat them as categorical elements.
Updated 3 days ago