HomeDocsAPI Reference
Kumo.ai
Docs

How can I improve the quality of my data?

General tips for ensuring data quality when creating graphs and executing pQueries and batch predictions.

Clean and Indicative Data

To ensure data quality, you should verify that the columns included in your Kumo tables contain accurate information with meaningful correlations to the prediction target.

Some tips on how to select columns:

  • Start by identifying all columns (from one or multiple tables) that might be related to or provide signals for predicting the target. This step marks an upper bound for the amount of information to ingest into Kumo.

  • As a quick first-level filter, you can remove redundant columns—columns that contain the same underlying information, but exist as multiple columns under one or multiple tables. For example, if you are a retailer with item sales data stored in two columns, one for per pound price values and the other for per kilogram price values, you should only include one of these columns to prevent redundancy.

  • Filter out columns where data is not actively maintained. For example, your company might have stopped collecting one type of data, but the column that stores historically collected data of that type still exists in your data store; or the definition of a column might have changed over time. Consult with the data owners/managers to confirm whether column names accurately describe their contents and are continuously updated.

  • Do not include columns that gets updated over time in a dimension table.

  • Check for formatting errors.

    • Categorical data: check whether there are typos, extra spaces, inconsistent capitalization, etc.
    • Numerical data: make sure that all values are entered as numbers instead of an English word (e.g., use “2” instead of “two”).
    • Multi-categorical data: separators should be consistently used across relevant values. For example, if using "," as the separator in one entry, refrain from using ";" (or some other symbol) as a separator for other entries.
    • Missing values: always leave fields with missing values completely blank—do not use strings or numbers like “null”, “N/A”, “-1”, etc.

Preventing Data Leakage

Data leakage occurs when information used to train the model is unavailable in the same manner at inference time. This may lead to high levels of training accuracy that are not reproducible during inference. The following are common causes of data leakage and how to prevent them from occuring.

Data Leakage from Fact Tables: The most obvious form of data leakage occurs when the target value is directly or indirectly added as a feature column in the Fact Table. For example, if predicting churn rate based on future customer interactions and you include a column describing the “amount of money this customer spends in the next 30 days," you are providing your model with predictive power over the target; this will lead to overly optimistic training prediction scores. In addition, having such a column in your fact table breaks the assumption that information is known at the marked timestamp, since you are making a prediction that—using this pQuery—you won’t have this data available as input for another 30 days!

Another common data leakage type occurs when an attribute in a fact table cannot be known at the timestamp. For example, if you have a column containing the "amount of items sold by the end of the week", but the timestamp marks the beginning of the week, including this column during training would constitute substantial data leakage. Again, this scenario breaks the fact table assumption—the timestamp column should be changed to mark the end of the week.

Data Leakage from Dimension Tables: Data leakage may occur in a Dimension Table if you include columns that update over time (or only become known after a certain time). For example, you should not include a column like "time of first purchase", since this is only known after the time of first purchase.


Missing Data

Kumo identifies missing values as entries that are completely left blank, and will not treat columns marked by special strings such as "NaN" nor "none" as missing values. Missing values in numeric columns are also commonly filled with invalid numbers (e.g., -1); these entries should be left blank to be considered missing data.

If one feature column has a high percentage of data missing, you may want to find out whether it's possible to retrieve the missing values. If not, you should decide whether the percentage present is indicative enough to overcome the potential noise introduced by the missing values.

You can view the percentage of missing data per column by clicking on Column Statistics under your Table summary and looking at the "Missing" column.

In some cases, you may want to keep a column with an abundance of missing values if the presence (or lack thereof) of a value in the column is an informative signal that the model can use for learning.

In general, a column should not be included if greater than 40% of data is missing at random. An exception to this is when the prediction target is a column in a dimension table (i.e., static); in this case, Kumo automatically views data points with missing target values as points for performing inference on. The percentage of missing values in the target column determines the training set and batch prediction output size, rather than being an indicator of data quality.

Learn More: