Preventing data leakage and handling time correctness

Data leakage occurs when information used to train the model is unavailable in the same manner at inference time. This may lead to high levels of training accuracy that are not reproducible during inference. The following are common causes of data leakage and how to prevent them from occuring.

Data Leakage from Fact Tables: The most obvious form of data leakage occurs when the target value is directly or indirectly added as a feature column in the <<glossary:Fact Table>>**. For example, if predicting churn rate based on future customer interactions and you include a column describing the “amount of money this customer spends in the next 30 days," you are providing your model with predictive power over the target; this will lead to overly optimistic training prediction scores. In addition, having such a column in your fact table breaks the assumption that information is known at the marked timestamp, since you are making a prediction that—using this pQuery—you won’t have this data available as input for another 30 days!

Another common data leakage type occurs when an attribute in a fact table cannot be known at the timestamp. For example, if you have a column containing the "amount of items sold by the end of the week", but the timestamp marks the beginning of the week, including this column during training would constitute substantial data leakage. Again, this scenario breaks the fact table assumption—the timestamp column should be changed to mark the end of the week.

Data Leakage from Dimension Tables: Data leakage may occur in a <<glossary:Dimension Table>> if you include columns that update over time (or only become known after a certain time). For example, you should not include a column like "time of first purchase", since this is only known after the time of first purchase.

Learn More: