Data leakage occurs when information from outside your training dataset infiltrates the model building process. The danger is that this leaked information gives your model an unrealistic advantage because in production, it won't have that helping hand. Here's why it's problematic:
- Overly Optimistic Results: Your model learns patterns or connections that will appear artificially strong due to the leak. This results in misleadingly high performance during training and validation.
- Poor Real-World Performance: When deployed, your model faces data it has never truly 'seen' before. Without the leaked information, the model's performance often crumbles in production.
A Typical Example of Data Leakage
Imagine this scenario:
- Dataset: Raw customer data, including spending habits, website interactions, time of day they make purchases, etc.
- Goal: Predict if a customer will convert (make a purchase).
- Mistake: You normalize the 'spending habits' feature by scaling it across the entire dataset (both past and future customers) before splitting the data.
- The Leak: Your normalization included not just existing training customers but also customers yet to be observed (in your test and validation sets). Your model indirectly learns hints about conversion behavior from future data.
Why Preprocessing Must Happen AFTER Splitting
The goal of train/test/validation splits is to mimic how your model handles unseen data in real life. Preprocessing steps that use information from the entire dataset before splitting violate this principle:
- Normalization/Scaling: Minimum and maximum values from your whole dataset become part of the calculation. Your model then 'peeks' into test and validation data during training.
- Imputing Missing Values: Using global statistics (mean, median) derived from the entire dataset bakes information from other splits into your training process.
- Feature Engineering: If you create features, like the frequency of certain user actions, across the whole dataset before splitting, the leak again occurs.
The Right Order
- Split Your Data: Do this first into train, test, and validation sets.
- Preprocessing on the Training Set: Explore your training data for statistics needed in preprocessing (means, standard deviations, etc.).
- Apply Transformations: With parameters calculated from the training set alone, transform the training set.
- Crucially, Use Training Set Parameters: Now, apply the same transformations (with values derived from the training set) to the validation and test sets.
Key Benefits