Data leakage occurs when information from outside your training dataset infiltrates the model building process. The danger is that this leaked information gives your model an unrealistic advantage because in production, it won't have that helping hand. Here's why it's problematic:

A Typical Example of Data Leakage

Imagine this scenario:

  1. Dataset: Raw customer data, including spending habits, website interactions, time of day they make purchases, etc.
  2. Goal: Predict if a customer will convert (make a purchase).
  3. Mistake: You normalize the 'spending habits' feature by scaling it across the entire dataset (both past and future customers) before splitting the data.

Why Preprocessing Must Happen AFTER Splitting

The goal of train/test/validation splits is to mimic how your model handles unseen data in real life. Preprocessing steps that use information from the entire dataset before splitting violate this principle:

The Right Order

  1. Split Your Data: Do this first into train, test, and validation sets.
  2. Preprocessing on the Training Set: Explore your training data for statistics needed in preprocessing (means, standard deviations, etc.).
  3. Apply Transformations: With parameters calculated from the training set alone, transform the training set.
  4. Crucially, Use Training Set Parameters: Now, apply the same transformations (with values derived from the training set) to the validation and test sets.

Key Benefits