What types of data can Kumo ingest?

Kumo supports the ingestion of various data source types by allowing you to configure connectors for the following:

AWS S3 (CSV or Parquet files)
Snowflake - Tables and Views
Databricks - Tables and Views
Google Cloud BigQuery - Tables

You also have the option of uploading a local file (CSV or Parquet files less than 1 GB) for ingestion into Kumo. In this case, you can skip connector creation and create a Kumo table directly by selecting Local File Upload.

In terms of data preprocessing, Kumo automatically preprocesses several data types when creating your Kumo table columns, including:

Numerical: Integers, floats
Categorical: Boolean or string values typically a single token in length
Text: String values typically multiple tokens in length, where the actual language content of the value has semantic meaning
Multi-categorical: Concatenation of multiple categories under a single string representation
ID: Numerical values used to uniquely identify different entities
Timestamp: Time/date information (for extracting year/month/date/hour/minute when applicable)
Embeddings: Consist of lists of floats, all of equal length, and are typically the output of another AI model.
Column types are automatically detected using heuristics on the distribution of values in each column’s data, and can also be manually configured.

Some notes about data type and semantic data types:

When we select numerical semantic type for NUMBER, DECIMAL, NUMERICAL data types, the internal data will be int64 or float. When we select ID or CATEGORICAL semantic type for NUMBER, DECIMAL, NUMERICAL data types, the internal data will be int64 without null values or float64 with null values. The value that fall outside the range of −2^53~2^53 (−9,007,199,254,740,992 to 9,007,199,254,740,992) since values beyond this range cannot be accurately represented by float64 and we lose precisions when process IDs.
For different date or time column in Kumo, if the column is choose as a created time col or end time col, the data will be filtered out if the timestamp is null or out of pandas timestamp range (earlier than 1677-09-21 00:12:43.145225)

Learn More: