HomeDocsAPI Reference
Kumo.ai
Docs

Why Kumo for Fraud Detection?

As of late, graph neural networks (GNNs) have gained prominence in the fraud research community, due to their ability to learn relationships between entities and events in massive networks, detecting patterns like "fraud rings" or "money mules". Large language models (LLMs) have also gained prominence in the fraud research community, due to their ability to learn about the world at large and use this information to reason about individual entities in the data. Kumo is ideal for combatting fraud, as it brings GNNs and LLMs to the table for fraud detection.

Kumo makes it easy to develop custom GNN+LLM fraud detection solutions for your particular datasets and problem definitions. From money laundering detection, credit card fraud and insurance fraud prevention, to return fraud detection, Kumo can help your organization mitigate a wide breadth of fraud and abuse detection problems. Kumo’s flexible model architecture works particularly well for teams with unique business requirements, including: highly imbalanced data, massive scale, data sparsity, complex heterogeneous data including images and text, and complex business logic. And for financial institutions with strict data security requirements, Kumo's secure architecture and data warehouse native deployments allow you to perform data processing within your Snowflake or Databricks accounts.

Fraud Solutions

Kumo’s predictive query language (PQL) provides the flexibility to build a wide variety of fraud detection models. PQL allows data scientists to declare the entity, target, filters, and optimization goal of a predictive task, in a purely declarative manner.

In practice, this enables teams to quickly build GNN-powered fraud and abuse detection models, which can be used both as internal facing tools, e.g. risk dashboards or insurance claim risk scoring, user-facing solutions such as merchant/seller risk scores, and automated solutions, for example credit card fraud or fraudulent transaction detection workflows.

Here are a few of the fraud patterns that Kumo can help identify:

Kumo is a good fit for…

  • Data science and machine learning teams looking to build internal and external fraud detection services.
  • Fraud analytics teams that want to use the best-performing fraud detection algorithms in academia and industry (GNNs) to enhance their understanding and detection of fraudulent behavior
  • Businesses with complex data requirements that typical fraud vendors do not support, such as combining many sources and modalities (image, text, sound, etc.) of data at large scales
  • Engineering teams looking to improve their existing fraud detection and flagging systems by using graph embeddings or GNN-powered candidate sources.
  • Organizations looking to invest in a general machine learning platform to uplevel a wide variety of solutions beyond fraud, including use cases in personalization, LTV/conversion prediction for user segmentation, fraud detection, demand forecasting, and price optimization.

Stand-out Features

  • Using a Kumo GNNs + LLMs fraud detection system gives 12.7% boost over the best reported GNN method and 19.6% boost over XGBoost in AUPRC on a popular large-scale anomaly detection dataset.
  • Solutions for both batch and real time decisions, enabling a single platform to power both automated flagging as well as manual deep dives
  • Prediction filters, enabling decisions to be contextualized for specific users or events, and comply with pre-existing business rules.
  • Great handling of highly imbalanced datasets, GNNs use complex data relationships together with features to make high quality predictions. They require less data and can achieve a lower false positive rate due to this additional signal.
  • GNN model planner enables data scientists to tune the model architecture for the dataset, including the training split, neighborhood sampling strategy, and model hyperparameters.
  • Support for visual signal (image-based), and NLP-powered text understanding.
  • Ability to export graph embeddings for all entities/transactions/merchants, which can be used as a feature in your existing fraud detection models, or used for deep dive analysis and user segmentation
  • REST API and SDK, enabling data scientists and ML engineers to develop, test, and deploy fraud detection models directly from notebooks or as part of an automated workflow.

Data Requirements

Kumo does not require data to be transformed to fit a prescriptive schema. Instead, predictions/embeddings are generated directly from the raw data that already exists in your data warehouse. The Kumo graph builder makes it easy to stitch together data from many different sources. Just connect the tables to the graph and go.

The best data to use for training a fraud detection system depends on the type of fraud; however, in general Kumo recommends the following types of data:

  • Customer profile information
  • Transaction information
  • Payment data
  • Merchant data
  • Fraud report data
  • App stream data
  • Auth/unauth browsing history
  • Second and third party data, risk scores, etc.

By using all of your data, Kumo can yield better prediction results compared to solutions that can only use a subset of your data.

Data Connectivity

Kumo reads and writes data directly to the client's data lakehouse, supporting cloud-first data science workflows. Supported lakehouses include Snowflake, Databricks, AWS S3, GCP BigQuery, and Azure Synapse (coming soon). For example, users have found success using Kumo as part of a DBT-based development environment, using Airflow for orchestration, and Streamlit for consumption.

Data Warehouse Native

Additionally, Kumo provides data warehouse native deployment options, which keeps your data secure by performing data processing within your Snowflake or Databricks account. This makes Kumo suitable for use in highly regulated environments, including banking, healthcare, and government.

Scale

Kumo uses a distributed GNN training system, written in C++, which can handle multi-terabyte datasets with tens of billions of rows, and has customers that make daily predictions for hundreds of millions of users.

Because GNNs are great at discovering complex patterns in sparse data, Kumo is also a good fit for small datasets containing 1000’s of users, and only 10’s of items.

Production Serving

In order to cover both in-product and out-of-product fraud detections/analytics workflows, Kumo supports a variety of serving methods

  • Batch Export: Export bulk predictions via UI or API. This is an easy way to power applications without strict latency requirements, or solutions which require human review.
  • Online Serving: Make predictions as part of transactional workflows, leveraging real-time signals and Kumo’s GNN artifacts. This can be provided as a portable code snippet, or served by Kumo behind a REST API.
  • Embedding Export: Export graph embeddings for use in existing fraud detection systems or to supercharge fraud analysis workflows.

Model Architecture

Kumo models are powered by a GNN architecture, inspired by several academic papers in recent history. Data scientists can benefit from these advances in model architecture, without needing to code them up manually.

Here is some of the research that is used by Kumo AI:

  • GraphSAGE does inductive representation learning to deliver great representations for users with very little interaction data
  • ID-GNN is a training process that enables the model to learn patterns such as repeat transactions or merchant affinity
  • PNA introduces a variety of aggregation operators which are explored by Kumo AutoML
  • GCN describes mean-pooling aggregation, which captures similarity between users with similar transaction histories
  • GIN captures frequency signal to learn more complex user behavior
  • NBF networks reduce the computational cost of models, by providing an efficient way to capture paths between nodes
  • GraphMixer uses temporal representation learning, to interpret sequences of user actions such as transaction/order history
  • RDL describes temporal sampling, which learns from past sequences of user actions to predict the future.

Data Encoding

Kumo also uses a powerful data encoding stack to convert multi-modal data into representations for deep learning.

  • PyTorch Frame finds the best encoding for a variety of tabular data types.
  • LLM foundation models can be used for understanding rich text data
  • Absolute and relative time encodings learn historical and seasonal patterns
  • Image Pixel Data can be used via image encoders such as CLIP

Benchmarks

Kumo’s model performance on the public DGraph anomaly detection dataset. gives 12.7% over the best reported GNN method and 19.6% boost over XGBoost on AUPRC.

You can reproduce this benchmark on Kumo by downloading the DGraph dataset and running the following Predictive Query:

PREDICT user.LABEL == 1

FOR EACH users.USER_ID

Model Planner

The Kumo model planner empowers data scientists to quickly iterate and apply their domain knowledge to the model.

Specifically, the Kumo model planner gives control over:

  • training table splits
  • neighborhood sampling
  • column encoding
  • training process
  • GNN architecture
  • optimization goals

For example, when dealing with complex graph datasets you can increase the number of hops considered and customize the number of samples considered at each step, which allows the model to capture more complex relationships. Additionally, using more powerful aggregations allows us to distinguish between different temporal user activities.

Predictive Query

Predictive Query Language (PQL) is a declarative syntax for defining machine learning problems. It is highly flexible and easy to learn, supporting inline filters, boolean expressions, and aggregation functions.

Data scientists can quickly experiment with many different and complex predictive formulations of a machine learning problem in very few lines of code.

Let’s say we’re trying to predict if the user intends to pay off an order in the following 90 days.

The example predictive query below predicts if the last payment in a 90 day window will leave the outstanding balance at 0, for each order placed in the US with value of over 200$:

PREDICT LAST(payments.amount, 0, 90, days) == 0
FOR EACH order.order_id
WHERE order.location == 'US' AND order.amount > 200

Evaluation & Explainability

As part of the training process, Kumo automatically computes data visualizations and metrics to help understand the model’s strengths and weaknesses.

  • Learning Curves and Distribution: Detects under and overfitting by monitoring the convergence rates. Tracks distribution of training data over time for balance.
  • Backtesting on Holdout: All models are back-tested on a configurable holdout dataset. The holdout data set may be downloaded for customer analysis.
  • Standard Evaluation Metrics and Charts: Including: ROC and PRC curve, cumulative gain chart, AUPRC, AUROC, predicted vs actual scatter plot and histogram, MAE, MSE, RMSE, SMAPE, average precision, per-category recall, F1, MAP
  • Baseline Comparison: Models are benchmarked against an automatically generated analytic baseline.
  • Column Explainability: A visualization highlighting which columns have the greeted predictive power helps prove that the model has no data leakage
  • Row Level Explainability: Users can understand the reason for individual predictions by seeing which rows contributed most to the result.

ML Ops

In order to support ongoing validation of model correctness, Kumo has the following features related to MLOps:

  • Data Source Snapshotting: During each job, data source statistics are snapshotted including size, time range, and import time, to enable faster root cause analysis.
  • Drift Detection: Distributions of features and predictions are monitored for drift. This enables early detection of issues, preventing bad predictions from being published to production.
  • Champion / Challenger: A champion / challenger approach can be adopted to validate key metrics of a newly trained model when orchestrating automatic job retraining through the REST API

System Integrations

Because Kumo writes directly to the data lakehouse, it is easy to connect with other cloud software or internal services. Here are just a few examples:

  • Internal tools: Use Kumo predictions and/or embeddings to power internal tools to and dashboards aid decision making.
  • Embedding Visualizer: Explore the fraud landscape by exporting user or other entity embeddings, find and explain complex fraud patterns
  • Chatbot/LLM: Enhance internal Chatbots or other LLM tools with Kumo predictions, embeddings, and XAI outputs