The Data Pipeline

Data In, Predictions Out

🔄 Lesson 3 covered the algorithms. But algorithms are only as good as the data they learn from. This lesson covers the data pipeline — how raw data becomes a trained model, and the critical mistakes that can ruin everything.

Here’s the pipeline:

RAW DATA → Clean → Engineer Features → Split → Scale → Train → Evaluate

Each step matters. Skip one, and your model either fails to learn or learns the wrong things.

Step 1: Data Cleaning

Real-world data is messy. Before ML can work, you need to handle:

Missing values: Some rows have blank fields. Options: remove those rows (if few), fill with the column average/median (if numeric), or fill with the most common value (if categorical).

Outliers: A house listed at $1 or $999,999,999 will distort the model. Identify extreme values and decide whether they’re real (keep) or errors (fix or remove).

Inconsistent formatting: “New York,” “new york,” “NY,” and “N.Y.” are all the same city. Standardize text fields before they confuse the algorithm.

Duplicate records: Same data point counted twice artificially inflates that pattern’s influence. Remove exact duplicates.

✅ Quick Check: Your dataset has 50,000 customer records. 200 records are missing the “income” field. Should you delete those 200 rows? It depends. If income is critical to your prediction and 200 out of 50,000 is tiny (0.4%), deletion is fine. But if missing income correlates with something meaningful (maybe lower-income customers skip the field), deleting them introduces bias. A safer approach: fill missing income with the median value, or create a binary feature “income_missing” that lets the model learn whether the absence of income information is itself predictive.

Step 2: Feature Engineering

Features are the input variables your model learns from. Feature engineering is the art of creating, selecting, and transforming features to help the model learn better.

Creating new features:

From a “date of birth” column, create an “age” feature
From “order date” and “delivery date,” create “delivery time in days”
From “city” and “state,” create “region” (Northeast, Southeast, West, etc.)

Encoding categorical data: Algorithms work with numbers, not text. “Red,” “Blue,” “Green” becomes three binary columns: is_red (0/1), is_blue (0/1), is_green (0/1). This is called one-hot encoding.

Feature selection: Not all features help. Some are irrelevant (customer’s favorite color probably doesn’t predict loan default). Some are redundant (square footage and square meters carry the same information). Removing unhelpful features can actually improve model accuracy by reducing noise.

Step 3: The Train-Test Split

This is where most beginners make their first serious mistake.

The rule: ALWAYS split your data BEFORE any preprocessing that uses statistics from the data (normalization, feature scaling, imputation with means).

Why this matters — data leakage:

If you normalize your data and then split, the normalization used information from the test set. Your model has indirectly “seen” future data. Its test performance will look better than real-world performance.

Correct order:

Split data into training set (70-80%) and test set (20-30%)
Calculate statistics (mean, min, max, etc.) from the training set only
Apply those statistics to normalize/scale both training and test sets
Train the model on the training set
Evaluate on the test set

The validation set: For tuning model parameters, you often split further: 70% training, 15% validation (for tuning), 15% test (for final evaluation). The test set should only be used ONCE — at the very end — to get an honest performance estimate.

Step 4: Cross-Validation

A single train-test split might get lucky (or unlucky) with which data ends up in the test set. Cross-validation gives a more reliable estimate.

K-fold cross-validation (the standard approach):

Divide data into K equal parts (folds) — typically K=5 or K=10
Train on K-1 folds, test on the remaining fold
Repeat K times, each time using a different fold as the test set
Average the K scores for your final performance estimate

Example with 5-fold CV on 10,000 rows:

Round 1: Train on rows 1-8,000, test on 8,001-10,000 → 92% accuracy
Round 2: Train on rows 1-6,000 + 8,001-10,000, test on 6,001-8,000 → 94% accuracy
Round 3-5: Similar rotations → 91%, 93%, 95%
Final estimate: 93% ± 1.4% (mean ± standard deviation)

The standard deviation tells you how consistent the model is. Low std (±1-2%) means reliable performance. High std (±5-10%) means the model is sensitive to which data it trains on — a red flag.

✅ Quick Check: You run 5-fold cross-validation and get scores of 95%, 94%, 96%, 93%, 94%. Your colleague runs a single train-test split and gets 97%. Whose result is more trustworthy? Yours — the 5-fold result (94.4% ± 1.0%) is based on 5 independent evaluations and shows consistent performance. Your colleague’s 97% might be a lucky split. If you ran your colleague’s approach 5 times with different random splits, they’d probably get scores ranging from 91% to 97% — and the average would be close to your 94.4%.

Step 5: Feature Scaling

Many algorithms are sensitive to the scale of features. If “income” ranges from 20,000 to 500,000 and “age” ranges from 18 to 90, the algorithm might over-weight income simply because its numbers are larger.

Two common approaches:

Method	Formula	Result	Best For
Normalization	(x - min) / (max - min)	0 to 1	When data has a known range
Standardization	(x - mean) / std_dev	Mean 0, std 1	When data follows a bell curve

Which algorithms need scaling? Neural networks, SVM, K-means, linear regression with regularization. Which don’t? Decision trees and random forests (they split on values, so scale doesn’t matter).

Key Takeaways

The data pipeline: clean → engineer features → split → scale → train → evaluate
Data leakage: ALWAYS split before preprocessing — normalize using only training set statistics
Feature engineering creates useful inputs the model can learn from — age from date of birth, delivery time from dates
Cross-validation (k-fold, typically k=5) gives more reliable performance estimates than a single train-test split
Feature scaling matters for some algorithms (neural networks, SVM) but not others (decision trees, random forests)
The quality of your data pipeline directly determines the quality of your model’s predictions

Up Next

Your model is trained. But how do you know if it’s actually good? Lesson 5 covers model evaluation — accuracy, precision, recall, F1 score, and the overfitting problem that trips up every ML practitioner.