How can I compare a predictive query to an external model?

Best practices for comparing the performance of a Predictive Query against a historical baseline, heuristic, or machine learning model.

Suggest Edits

When developing a new predictive query for decision-making in your organizations, you may need to answer the question "how much benefit does this predictive query bring, compared to the alternatives?" for your stakeholders. The alternative could be as simple as a business rule or heuristic, or as complex as an existing machine learning model.

Kumo provides a variety of tools and best practices to make this comparison as painless as possible.

Comparing Against Historical Baselines

Often times, you can deliver the most impact by going from 0 to 1 (i.e., by deploying machine learning to a business process where no machine learning was being done before). For example, you might be developing an entirely new email notification to your customer, and you would like to recommend a set of products for them to buy.

In this situation, it is recommended to compare the performance of your machine learning model against a simple historical baseline. For example, in the case of product recommendation, a historical baseline could be the "top 10 most popular products on your platform" or "the last 5 items that the user viewed".

To make your job easier, Kumo can automatically generate many simple historical baselines for your predictive query, and automatically compare its performance against the historical baseline using metrics like Accuracy and Recall. In situations that require comparisons against more complex historical baselines, you can quickly download the holdout dataset from Kumo's user interface or generate predictions on a holdout dataset of your own and load it into a notebook or spreadsheet for custom analysis.

Comparing Against Existing Models

In some situations, you may be trying to replace an existing production model with a predictive query. You might be doing this to reduce maintenance cost, or possibly to reap the benefits of deep learning to improve model performance. Regardless of your ultimate goal, you will likely need to compare the performance of the existing model with the Kumo-trained model before proceeding.

In these situations, you ultimately need to compare the Kumo model and your existing model on a shared holdout dataset. In other words, you need to hold out a portion of your labeled data, generate predictions using both models, and then compare standard eval metrics (e.g., AUROC). While this may seem straightforward at a high level, the following guidelines can help you avoid mistakes in evaluation setup that can cause incorrect or confusing results.

Use Kumo's model planner to define the holdout time split: The simplest ways to define a common holdout dataset is by specifying a time range. For example, using a single line of code inside the Kumo model planner, you can tell Kumo to train on data up to a specific date, and then holdout the data after this date. Once the model is trained, Kumo will automatically generate model evaluation metrics on the held-out dataset. Assuming that you have correctly specified the time_col for all tables, Kumo's trainer will ignore all held-out data during the training process, so you don't have to worry about leaking data from the future. Remember to also double-check your comparison model for data leakage.
Bring your own holdout data set: If you don't want to use Kumo's built-in time split feature, you can physically separate your training data from your holdout data prior to Kumo ingestion. In this case, you would train your model using Kumo on your training dataset and generate batch predictions on a graph containing the raining + holdout dataset. At this point, you can calculate the model evaluation metrics manually, by comparing your batch predictions with your source of truth labels. This approach can give you more confidence that the Kumo model is not "cheating" by accidentally leaking information from the future (which could be caused by a misconfigured graph). However, this approach is also a bit more prone to human error. You need to make sure to hold out the data properly across all tables in the graph, as mistakes could throw off the model results. If you are doing this for the first time, be sure to share the full holdout dataset with your Kumo counterpart, so they can verify that it was done correctly.
Perform sanity checks on the holdout data This seemingly common sense procedure is often overlooked—if your holdout dataset contains bad data, your results are completely invalidated. The following are some typical holdout data-related issues you should look for:
- Duplicate rows in the holdout dataset
- Major distribution shifts from train to holdout
- Outliers that cause significant variance in the evaluation metrics
Remember that Kumo needs raw data GNNs need access to raw data to perform well. If you are using Kumo on pre-engineered features, you will match the performance of classical models (e.g., GBDT) at best, as a lot of information gets dropped during the feature engineering process and there is nothing more for Kumo to learn.

Learn More:

Updated 7 months ago