How do I handle time correctness to prevent data leakage?
Data leakage can be mitigated using several methods, depending on the cause. The most prevalent form of data leakage occurs when the target value is directly or indirectly added as a feature column in the fact table. For example, if predicting churn rate based on future customer interactions and you include a column describing the “amount of money this customer spends in the next 30 days," you are providing your model with predictive power over the target; this will lead to overly optimistic training prediction scores. In addition, having such a column in your fact table breaks the assumption that information is known at the marked timestamp, since you are making a prediction that—using this pQuery—you won’t have this data available as input for another 30 days!
Another common data leakage type occurs when an attribute in a fact table cannot be known at the timestamp. For example, if you have a column containing the "amount of items sold by the end of the week", but the timestamp marks the beginning of the week, including this column during training would constitute substantial data leakage. Again, this scenario breaks the fact table assumption—the timestamp column should be changed to mark the end of the week.
Data leakage may occur in a dimension table if you include columns that update over time (or only become known after a certain time). For example, you should not include a column like "time of first purchase", since this is only known after the time of first purchase.
Learn More:
- How can I improve the quality of my data?
- What mechanisms does Kumo provide to detect data leakage?
- How can I troubleshoot data quality issues or pQuery problems?
- Column Analysis
Updated 8 months ago