HomeDocsAPI Reference
Kumo.ai
Docs

Feature Summary

A one-pager of Kumo's key features and benefits.

Kumo is a platform for making the deployment of state-of-the-art predictions in production dramatically faster and easier for users across the enterprise.

By using the latest Graph Machine Learning (ML) techniques and Kumo’s easy-to-use declarative Predictive Query (pQuery) language, users can execute pQueries that automate all the major steps in the typical ML pipeline, including target label engineering, feature engineering, model architecture, hyperparameter search, and model deployment; all in a way that is optimized for Graph Neural Network (GNN) specific approaches. And at every step, ML engineers and practitioners can access more granular controls and configurations for customizing the ML pipeline to their domain-specific requirements.

Deployment Options

Kumo offers the following benefits as a managed SaaS platform:

  • No virtual or physical hardware to install or manage
  • No software to install or manage
  • Kumo handles ongoing maintenance, management, upgrades, and tuning.

Kumo also offers an early access program with options to deploy within your Databricks or Snowflake environments. Please reach out to your Kumo Success Manager for more details.

This feature summary covers features available in the Kumo graphical web UI, as well as REST API (details under Production Usage & Support down below)

Connecting Data Sources

To start ingesting your data into Kumo, you will first configure connectors to your data sources, including any required access credentials. This will allow you to create tables in Kumo for building your pQueries.

Kumo connectors are currently available for:

  • AWS S3 - CSV or Parquet
  • Snowflake - Tables and Views
  • Google Cloud BigQuery - Tables

You also have the option of uploading a local file (CSV or Parquet files less than 1 GB) for ingestion into Kumo. In this case, you can skip connector creation and create a Kumo table directly by selecting Local File Upload.

Validating Connected Tables

Kumo offers a full-range of visualizations and statistics for your connected tables, including the following:

  • Number of entries
  • Number of unique values
  • Percentage of missing values
  • Mean, minimum, maximum, 25/50/75th percentile (numeric only)
  • Visualization of the value distribution for categorical, numerical, and timestamp data

Data Preparation

Kumo automatically preprocesses several data types when creating your Kumo table columns, including:

  • Numerical: Integers, floats
  • Categorical: Boolean or string values typically a single token in length
  • Text: String values typically multiple tokens in length, where the actual language content of the value has semantic meaning
  • Multi-categorical: Concatenation of multiple categories under a single string representation
  • ID: Numerical values used to uniquely identify different entities
  • Timestamp: Time/date information (for extracting year/month/date/hour/minute when applicable)
  • Embeddings: Consist of lists of floats, all of equal length, and are typically the output of another AI model.

Column types are automatically detected using heuristics on the distribution of values in each column’s data, and can also be manually configured.

Kumo Views

If you seek to do more fundamental data preparation within Kumo (e.g., creating columns consisting of custom aggregations difficult to do in upstream ETL pipelines), you can create SQL Views on top of your tables that you’ve already connected.

Kumo Graph Creation

Kumo allows the linkage of different tables from the same or different connected data sources into a unified Kumo graph based on exact match links between the primary and foreign keys. Automatic link suggestions are provided by doing an exact match on the column names across the selected tables.

Upon graph creation, Kumo provides approximate statistics on linkage health, based on the percentage/count of times the keys in one table appear in the other linked table.

Kumo allows creating graphs that are up to 50 billion rows and a total of 838 TBs in size across all tables. Please refer to the quotas page for more detailed information.

Kumo pQueries

The heart of the Kumo platform is the pQuery, a declarative interface for defining your predictions on your Kumo graph.

At minimum, a pQuery requires two components:

  1. A PREDICT clause that defines the target you want to predict (e.g., future sales), and;
  2. A FOR EACH clause that specifies the entity of your prediction (e.g., customers, stores).

Possible targets include:

  • A static attribute of an entity that happens to be missing in the graph (like dietary preference of a customer)
  • An aggregation over future facts tied to an entity (e.g., sum of purchase values)
  • Future interactions between this entity and other entities (e.g., items a user might buy)
  • Relational operators applied to the above (e.g., whether sum of sales is greater than zero)

For example, the following pQuery predicts user LTV based on a 30-day sum of purchases:

PREDICT SUM(Purchases.price, 0, 30)
FOR EACH Users.ID

You can also create additional pQuery filters to make the predictions more specific to your downstream use cases. The following are some of the possibilities:

  • Filters to narrow down which set of entities you want to make predictions on
  • Filters to narrow down which fact rows you want to aggregate over for your target calculation
  • Imposed assumptions on events that may happen in the future for what-if scenario planning

For example, a pQuery that predicts the LTV of active users (defined as having at least one view in the last 60 days), assuming a coupon is given in the next week, would look like the following:

PREDICT SUM(Purchases.price, 0, 30)
FOR EACH Users.ID WHERE COUNT(Views.*, -60, 0) > 0
ASSUMING COUNT(Coupons.*, 0, 7) > 0

If available PQL language features are insufficiently expressive for your use case, you can also choose to leverage Kumo's SQL View functionality to craft even more bespoke target definitions.

📘

Please see Kumo's pQuery Reference for more about PQL and writing pQueries.


pQuery Creation

Once your pQuery is defined, the following ML processes are automatically taken care of for you:

  • Training example and associated target label creation
  • Defining appropriate portions of training examples to hold out from training data to use in final model evaluation
  • Experiment orchestration for determining the right GNN architectures, hyperparameters, and features to use across graphs for ML model training
  • Point in time correctness verification and information leakage prevention

Kumo trains your pQuery over many historic slices of data by generating training examples over chunks of historical time frames. These examples are learned one time frame at a time, with stats for each timeframe available during processing for you to validate target label distribution correctness.

To find the best set of parameters for use with your pQuery, Kumo uses AutoML to search over a wide space of GNN options, including different neural net layer type/count, normalization, hidden dimension size, etc. options. Kumo improves the efficiency of this search by leveraging aggregated and anonymized learnings across all Kumo-trained ML pipelines.

Model Planning

Kumo's Model Planner provides advanced ML practitioners with fine-grained control over encoders, training strategy, and the AutoML search space. Possible AutoML search space customizations include:

  • Training budget/early stopping aggressiveness
  • Sampling of training examples of different classes (to address imbalance)
  • The evaluation metric to tune on
  • How to encode individual columns (for example normalization of numeric values, one hot versus indexed representation of classes, which word embedding to use for text, etc.)
  • How many hops of neighbors to sample input from
  • Whether to support prediction on entities and targets that were not seen at training time
  • Relevant model architecture and optimizer hyperparameters

These features leverage the capabilities of PyTorchGeometric, the most popular open source framework for creating GNNs, founded by Kumo members.

Model Training

Advanced ML practitioners can view training data statistics and track the progress of each experiment (e.g., loss and validation metric curves) in real-time, as well as view the hyperparameters used for each experiment. Post-training, experiment results and model planner details per experiment are availlable via the pQuery's Training Summary page.

pQuery Evaluation & Understanding

The metrics that Kumo uses to measure the accuracy of your evaluation data split’s pQuery vary depending on the target type. The following metrics are currently included:

  • Receiver Operating Characteristic chart & AUROC
  • Precision-recall Curve chart & AUPRC
  • Confusion matrix
  • Cumulative Gain chart
  • MAE/MSE/RMSE/SMAPE
  • A histogram of predicted values alongside actual target labels
  • F1@K, MAP@K, Precision@K, Recall@K; for K = 1, 10, and 100

To help you understand the behavior of your pQuery, Kumo also allows you to explore (in visual charts):

  • Contribution Score: The (absolute) contribution of each table and columns within each table to the final predictions
  • Column Analysis: Directional contribution (positive or negative) of how the final predictions are affected by the range of values within each column.

Batch Prediction Outputs

Batch predictions can either output the prediction values or the numerical embeddings representing the entities in the predictive query.

You have the option to apply filters while generating new batch predictions in order to get predictions for a subset of data that is relevant to your downstream pipeline and business use case. In addition, Kumo allows specifying the number of parallel workers (up to 4) that work on your prediction task, though the actual number of workers that generate the predictions is decided based on the dataset size.

All outputs concerned with classification tasks are calibrated using Platt Scaling to ensure the prediction scores match with their actual probabilities of being true. Kumo automatically refreshes underlying table and graph data when a model is trained, re-trained, or when a batch prediction is created.

Production Use and Support

Kumo support automated production use cases through our REST API, which supports:

  • CRUD operations for connectors, tables, graphs, and predictive queries
  • Retraining an existing pQuery.
  • Starting a new batch prediction job.
  • Listing, getting the status of, canceling, and deleting the above jobs

You can kick-off up to 10 asynchronous jobs (training/batch prediction) that will get queued and run sequentially one after another as older jobs complete.

Batch predictions can be configured to output to an S3 bucket, or downloaded to your local machine from the batch prediction's detail page.

Finally to help you ensure the resilience of your production deployments, we help with:

  • Detecting stale pQueries
  • Detecting when upstream fact table pipelines have broken

Security & Compliance

Kumo is SOC 2 Type 1 and Type 2 compliant.