Query the Future

Data science teams deal with a huge variety of predictive tasks that improve efficiency and operations, saving millions of dollars worth of time and waste. Opportunities to drive impact include supply chain optimization, demand forecasting, pricing, revenue recapture, customer claim classification, inventory optimization, and predictive maintenance. With so many burning needs and competing priorities, data scientists can easily get stretched too thin.

Kumo comes to the rescue of data science teams, by eliminating the hassle of traditional predictive modeling. By eliminating feature engineering and model architecture development, data scientists teams can use Kumo to multiply the impact of each individual.

The key innovation of Kumo is the predictive query language, which translates declarative problem statements into Graph Neural Networks (GNNs) that run directly on relational data. Kumo also works with Large Language Models (LLMs), enabling teams to work with unstructured text data, and build explainable, chat-powered interfaces for predictions.

Finally, since nobody likes maintaining complex infrastructure and data pipelines, Kumo can work directly on the relational data in data warehouses. Data Warehouse Native deployments process data within your Snowflake or Databricks accounts, minimizing the need for time-consuming security reviews and ensuring your data remains secure and compliant.

Solutions

Kumo’s predictive query language (PQL) enables data scientists to declare the entity, target, filters, and optimization goal of a predictive task, in a purely declarative manner.

In practice, this enables teams to quickly obtain a wide spectrum of predictive results that can help more efficient operations.

Here are a few of the solutions that Kumo supports.

Demand Forecasting
Pricing elasticity prediction
On-time or late delivery prediction
Inventory forecasting
High-risk customer prediction
Late payment prediction
Missing claim identification
Parts failure prediction

Kumo is a good fit for…

Data science and machine learning teams looking to build automated business operation pipelines based on predicted outcomes
Business analysts and operation managers that want to obtain the best-performing predictive outcomes flexibly without the help of data scientists or ML engineers
Predictive tasks that need to consider a lot of dependencies across various entities and use various datasets
Businesses with unique requirements that need to reflect own business process and constraints beyond industry standard
Engineering teams looking to improve their existing automated systems by powering better models and more data.
Organizations looking to invest in a general machine learning platform to uplevel a wide variety of solutions beyond predictive analytics, including recommendation and fraud detection.

Stand-out Features

PQL provides the SQL-like language for predictive tasks and eliminates the need for feature engineering and model development pipeline.
Kumo connector enables you to connect multiple datasets you own. You will have full control on the data.
Kumo graph can represent the dependencies among various entities of your business, directly reflected from your relational data.
Multiple Predictive Queries can share the underlying Kumo graph and data.
GNNs and LLMs trained on your data combined with LLMs on public data, gives you great performance to leverage both your domain knowledge (backed by GNNs + your data) and general knowledge (backed by LLMs).
Prediction filters enable predictive tasks to incorporate business logic flexibly. For example, Kumo can predict the demand prediction only for the new items released in the last month.
Kumo explainability provides the insights about predictions to be understood and leveraged by humans.
GNN model planner enables data scientists to tune the model architecture for the dataset, including the training split, neighborhood sampling strategy, and model hyperparameters.
Support for visual signal (image-based), and LLM-powered text understanding.
REST API and SDK allow data scientists and ML engineers to develop, test, and deploy models directly from notebooks or as part of an automated workflow.

Data Requirements

Kumo does not require data to be transformed to fit a prescriptive schema.

Instead, Kumo produces predictions directly from the raw data that already exists in the data warehouse. The Kumo graph builder makes it easy to stitch together data from many different sources. Just connect the tables to the graph and go.

For the best predictive results, Kumo encourages using data such as:

Customer profile information
Operation outcome (such as on-time vs. late) history
Operation status change history (such as in-transit, submitted)
Backbone network data (e.g. supply-chain route)
Promotion and discount history
External data such as weather forecast that can impact the prediction performance

By using all of your data, Kumo can achieve better recommendation quality compared to solutions that can only use a subset.

Data Connectivity

Kumo reads and writes data directly to the client's data lakehouse, supporting cloud-first data science workflows. Supported lakehouses include Snowflake, Databricks, AWS S3, GCP BigQuery, and Azure Synapse (coming soon). For example, users have found success using Kumo as part of a DBT-based development environment, using Airflow for orchestration, and Streamlit for consumption.

Data Warehouse Native

Additionally, Kumo provides data warehouse native deployment options, which keeps your data secure by performing data processing within your Snowflake or Databricks account. This makes Kumo suitable for use in highly regulated environments, including banking, healthcare, and government.

Scale

Kumo uses a distributed GNN training system that can handle multi-terabyte datasets with tens of billions of rows. This sytem is currently supporting users that make daily recommendations for more than 100M active users or more than 10M inventory items. Kumo is also ideal for small datasets containing 1000’s of users and only 10’s of items, as GNNs are optimal for discovering complex patterns in sparse data.

Operational Serving

While Kumo supports a variety of serving methods for the prediction outcome, batch export to the data warehouse is the most common for the predictive operations:

Batch Export: Export bulk predictions via UI or API to the warehouse. This produces all the prediction scores to be easily fed to the other automated pipelines or data analytics.

Model Architecture

Kumo recommendations are powered by a GNN architecture, inspired by several academic papers in recent history. Data scientists can benefit from these advances in model architecture, without needing to code them up manually.

Here is some of the research that is used by Kumo AI:

GraphSAGE does both transductive and inductive representation learning to deliver great predictions for entities with various dependencies and histories
ID-GNN is a training process that enables the model to learn patterns such as repeat purchase or brand affinity
PNA introduces a variety of aggregation operators which are explored by Kumo AutoML
GCN describes mean-pooling aggregation, which captures similarity between entities with similar historical patterns
GIN captures frequency signal to learn more complex behavior like frequent operation events vs. rare events
NBF networks reduce the computational cost of models, by providing an efficient way to capture paths between nodes
GraphMixer uses temporal representation learning, to interpret sequences of operation events
RDL describes temporal sampling, which learns from past sequences of events to predict the future.

Data Encoding

Kumo also uses a powerful data encoding stack to convert multi-modal data into representations for deep learning.

PyTorch Frame finds the best encoding for a variety of tabular data types.
LLM foundation models can be used for understanding rich text data
Absolute and relative time encodings learn historical and seasonal patterns
Image Pixel Data can be used via image encoders such as CLIP

Model Planner

The Kumo model planner empowers data scientists to quickly iterate and apply their domain knowledge to the model.

Specifically, the Kumo model planner gives control over:

Training table splits
Neighborhood sampling
Column encoding
Training process
GNN architecture
Optimization goals

Predictive Query Language

PQL is a declarative syntax for defining machine learning problems. It is highly flexible and easy to learn, supporting inline filters, boolean expressions, and aggregation functions.

Data scientists can quickly experiment with many different and complex predictive formulations of a machine learning problem in very few lines of code.

For example, the following PQL statement predicts the amount of winter clearance items sold in the US in the next two weeks.

PREDICT COUNT(
  purchases.item_id               
  	WHERE purchases.markdown > 0.2 AND purchases.country = 'US', 0, 14)
FOR EACH item.item_id WHERE item.season = 'winter'

Evaluation and Explainability

As part of the training process, Kumo automatically computes data visualizations and metrics to help understand the model’s strengths and weaknesses.

Learning Curves and Distribution: Detects under and overfitting by monitoring the convergence rates. Tracks distribution of training data over time for balance.
Backtesting on Holdout: All models are back-tested on a configurable holdout dataset. The holdout data set may be downloaded for customer analysis.
Standard Evaluation Metrics and Charts: Including: ROC and PRC curve, cumulative gain chart, AUPRC, AUROC, predicted vs actual scatter plot and histogram, MAE, MSE, RMSE, SMAPE, average precision, per-category recall, F1, MAP
Baseline Comparison: Models are benchmarked against an automatically generated analytic baseline.
Column Explainability: A visualization highlighting which columns have the greeted predictive power helps prove that the model has no data leakage
Row Level Explainability: Users can understand the reason for individual predictions by seeing which rows contributed most to the result.

MLOps

In order to support ongoing validation of model correctness, Kumo has the following features related to MLOps:

Data Source Snapshotting: During each job, data source statistics are snapshotted including size, time range, and import time, to enable faster root cause analysis.
Drift Detection: Distributions of features and predictions are monitored for drift. This enables early detection of issues, preventing bad predictions from being published to production.
Champion / Challenger: A champion / challenger approach can be adopted to validate key metrics of a newly trained model when orchestrating automatic job retraining through the REST API