HomeDocsAPI Reference
Kumo.ai
Docs

What types of data can Kumo ingest?

Kumo supports the ingestion of various data source types by allowing you to configure connectors for the following:

  • AWS S3 (CSV or Parquet files)
  • Snowflake - Tables and Views
  • Databricks - Tables and Views
  • Google Cloud BigQuery - Tables

You also have the option of uploading a local file (CSV or Parquet files less than 1 GB) for ingestion into Kumo. In this case, you can skip connector creation and create a Kumo table directly by selecting Local File Upload.

In terms of data preprocessing, Kumo automatically preprocesses several data types when creating your Kumo table columns, including:

  • Numerical: Integers, floats
  • Categorical: Boolean or string values typically a single token in length
  • Text: String values typically multiple tokens in length, where the actual language content of the value has semantic meaning
  • Multi-categorical: Concatenation of multiple categories under a single string representation
  • ID: Numerical values used to uniquely identify different entities
  • Timestamp: Time/date information (for extracting year/month/date/hour/minute when applicable)
  • Embeddings: Consist of lists of floats, all of equal length, and are typically the output of another AI model.
  • Column types are automatically detected using heuristics on the distribution of values in each column’s data, and can also be manually configured.

Learn More: