Data Cleaning and Transformation

🔄 Quick Recall: In the last lesson, you designed database schemas with AI — from business requirements to normalized tables with proper types, constraints, and indexes. Now comes the real challenge: the data going INTO those tables is rarely clean.

The Data Quality Problem

Every database professional knows this truth: data is messy. Customer names have inconsistent capitalization. Phone numbers come in 5 different formats. Dates mix MM/DD/YYYY with YYYY-MM-DD. Duplicates lurk everywhere.

AI transforms data cleaning from hours of manual scripting into a guided, systematic process. You describe the mess, AI generates the cleanup scripts.

Step 1: Profile Before You Clean

Never start cleaning without understanding the data first:

I have a table called raw_customers with these columns:
- name (TEXT)
- email (TEXT)
- phone (TEXT)
- city (TEXT)
- state (TEXT)
- signup_date (TEXT)
- total_purchases (TEXT)

Write SQL queries to profile this data:
1. Count of NULL/empty values per column
2. Count of distinct values per column (to spot inconsistencies)
3. Sample of unique values for city and state (to see format issues)
4. Check if signup_date has consistent formatting
5. Check if total_purchases can be safely cast to NUMERIC
6. Identify potential duplicates (same email, different names)

This profiling step typically reveals 5-10 issues you wouldn’t have noticed otherwise.

✅ Quick Check: Why does the profiling query check “distinct values for city and state” specifically?

Because city and state columns are notorious for inconsistencies. “New York”, “new york”, “NY City”, “New York City”, and “NYC” might all appear in the same column. Seeing the distinct values reveals exactly how messy the data is and helps you plan the standardization approach. You might find that 80% of your cleaning work is in just 2-3 columns.

Step 2: Build the Cleaning Pipeline

Once you know the issues, have AI generate the cleaning scripts:

Based on profiling, I found these issues in raw_customers:
1. Names have inconsistent capitalization ("john smith", "JANE DOE", "Bob Jones")
2. Phone numbers mix formats: "(555) 123-4567", "555-123-4567", "5551234567"
3. State column has both abbreviations and full names ("CA", "California", "calif")
4. signup_date has 3 formats: "2024-01-15", "01/15/2024", "January 15, 2024"
5. total_purchases is TEXT but should be NUMERIC; some have "$" prefix
6. 47 duplicate emails

Database: PostgreSQL

Write a cleaning script that:
1. Creates a cleaned_customers table with proper data types
2. Standardizes each column (title case names, E.164 phone format, state abbreviations)
3. Parses all date formats into proper DATE type
4. Converts total_purchases to NUMERIC
5. Deduplicates by email (keep the most recent signup)
6. Makes the script idempotent (safe to rerun)

Common Cleaning Patterns

AI handles these standard cleaning tasks reliably:

Standardizing text:

-- Title case names
UPDATE customers SET name = INITCAP(LOWER(name));

-- Standardize email (lowercase, trim whitespace)
UPDATE customers SET email = LOWER(TRIM(email));

Handling duplicates:

-- Keep the most recent entry per email
DELETE FROM customers
WHERE id NOT IN (
    SELECT DISTINCT ON (email) id
    FROM customers
    ORDER BY email, created_at DESC
);

Fixing dates:

-- Convert mixed date formats (AI generates the CASE logic)
UPDATE raw_data SET
    clean_date = CASE
        WHEN date_col ~ '^\d{4}-\d{2}-\d{2}$' THEN date_col::DATE
        WHEN date_col ~ '^\d{2}/\d{2}/\d{4}$' THEN TO_DATE(date_col, 'MM/DD/YYYY')
        ELSE NULL
    END;

Step 3: Validate the Results

After cleaning, verify that the transformation worked:

I just ran my cleaning script on raw_customers → cleaned_customers.

Write validation queries that check:
1. No NULL values in required columns (name, email)
2. All emails match a basic email pattern
3. All phone numbers are in consistent format
4. All dates fall within reasonable range (not in the future, not before 2000)
5. No duplicate emails remain
6. Row count comparison: raw vs. cleaned (how many rows were dropped and why?)

The row count check is critical. If you started with 10,000 raw rows and ended with 8,500 cleaned rows, you need to account for the 1,500 that were dropped or merged. Were they all duplicates? Did some fail validation? Unexplained data loss is a serious red flag.

ETL Patterns with AI

For recurring data loads, build Extract-Transform-Load pipelines:

I receive a CSV file weekly from [source] with these columns: [list columns].
The data needs to be cleaned and loaded into my [database] table.

Create an ETL script that:
1. Loads the CSV into a staging table
2. Applies cleaning transformations (standardize formats, handle nulls)
3. Validates the cleaned data (reject rows that fail checks)
4. Merges into the production table (insert new rows, update existing ones)
5. Logs the results (rows processed, inserted, updated, rejected)
6. Is idempotent and safe to rerun

✅ Quick Check: Why does the ETL pipeline use a staging table instead of loading directly into the production table?

Because a staging table lets you clean and validate data before it touches production. If the CSV contains corrupt data, you catch it in staging — the production table stays clean. It’s a safety buffer: load raw data in, clean it, validate it, then merge only the good data into production. This prevents dirty data from ever reaching your application.

Exercise: Clean a Real Dataset

Take a messy dataset from your own work (or download a practice dataset from Kaggle) and:

Profile it with AI-generated queries
Identify the top 5 data quality issues
Have AI generate a cleaning script
Run the script and validate results
Check the row count: raw vs. cleaned

Key Takeaways

Always profile data before cleaning — understanding the mess helps you plan a systematic approach
AI generates cleaning scripts from descriptions of your data issues, turning hours of manual work into minutes
Make cleaning scripts idempotent (safe to rerun) for recurring data loads
Use SQL for bulk standardization (dates, formats, deduplication) and Python for complex logic (fuzzy matching, API lookups)
Always validate after cleaning: check for nulls, format consistency, reasonable ranges, and account for every dropped row
ETL pipelines with staging tables protect production data from dirty imports

Up Next: In the next lesson, you’ll learn query optimization and performance — how to use AI to find and fix slow queries that bog down your database.