What types of data can Kumo ingest?
Kumo supports the ingestion of various data source types by allowing you to configure connectors for the following:
- AWS S3 (CSV or Parquet files)
- Snowflake - Tables and Views
- Databricks - Tables and Views
- Google Cloud BigQuery - Tables
You also have the option of uploading a local file (CSV or Parquet files less than 1 GB) for ingestion into Kumo. In this case, you can skip connector creation and create a Kumo table directly by selecting Local File Upload.
In terms of data preprocessing, Kumo automatically preprocesses several data types when creating your Kumo table columns, including:
- Numerical: Integers, floats
- Categorical: Boolean or string values typically a single token in length
- Text: String values typically multiple tokens in length, where the actual language content of the value has semantic meaning
- Multi-categorical: Concatenation of multiple categories under a single string representation
- ID: Numerical values used to uniquely identify different entities
- Timestamp: Time/date information (for extracting year/month/date/hour/minute when applicable)
- Embeddings: Consist of lists of floats, all of equal length, and are typically the output of another AI model.
- Column types are automatically detected using heuristics on the distribution of values in each column’s data, and can also be manually configured.
Learn More:
Updated 7 months ago