Data Cleaning and Preparation with AI
Learn AI-powered techniques to clean messy spreadsheet data: fix dates, remove duplicates, standardize formats, and handle missing values in minutes.
You just received a spreadsheet from a client. The dates are in three different formats. Names are sometimes “John Smith,” sometimes “smith, john,” sometimes “JOHN SMITH.” There are 47 duplicate rows, and random blank cells scattered throughout.
This is reality. And it’s where AI saves you the most time.
🔄 Quick Recall: In the previous lesson, you learned the CONTEXT → TASK → CONSTRAINTS → OUTPUT prompt framework. You’ll use that same framework here — but instead of generating calculation formulas, you’ll generate cleaning formulas.
The Data Cleaning Workflow
Before diving into AI prompts, here’s the order that works best:
- Remove duplicates — Shrink the dataset first
- Standardize formats — Dates, names, categories, phone numbers
- Handle missing values — Fill, flag, or remove blank cells
- Validate — Check cleaned data against known values
Let’s tackle each step with AI.
Step 1: Finding and Removing Duplicates
The problem: Your dataset has 2,000 rows but some entries appear multiple times with slight variations (“John Smith” vs “John Smith” with an extra space).
AI prompt:
“In Google Sheets, I have data in columns A through E (Name, Email, Phone, Date, Amount). Create a formula for column F that flags TRUE if this row is a duplicate based on the Email in column B (case-insensitive). Mark the first occurrence as FALSE and subsequent duplicates as TRUE.”
AI will generate something like:
=COUNTIF(B$1:B1, LOWER(B2)) > 0
After flagging duplicates, you can filter and review before deleting.
Tip: Always flag duplicates first instead of auto-deleting. Some “duplicates” are legitimate (e.g., two orders from the same customer).
✅ Quick Check: Why should you flag duplicates before deleting them? (Answer: Some apparent duplicates may be legitimate entries — like repeat orders from the same customer — that you don’t want to remove.)
Step 2: Standardizing Formats
Format inconsistencies are the most common data problem. Here are AI prompts for the big three:
Dates
Prompt:
“Column A has dates in mixed formats: some are MM/DD/YYYY, some are DD-Mon-YYYY (like 15-Jan-2026), and some are just text like ‘January 15, 2026’. Create a formula for column B that converts all of these to YYYY-MM-DD format.”
Names
Prompt:
“Column A has names in inconsistent formats: ‘john smith’, ‘SMITH, JOHN’, ‘John Smith’, ‘smith,john’. Create a formula that standardizes all names to ‘First Last’ format with proper capitalization.”
AI typically uses a combination of PROPER(), TRIM(), SUBSTITUTE(), and text extraction functions.
Phone Numbers
Prompt:
“Column C has US phone numbers in mixed formats: (555) 123-4567, 555-123-4567, 5551234567, +1 555 123 4567. Create a formula that standardizes all to (XXX) XXX-XXXX format.”
✅ Quick Check: When standardizing names, why is specifying the inconsistent formats in your prompt important? (Answer: AI needs to know what variations exist so it can generate logic that handles each case — a formula for fixing “SMITH, JOHN” is different from one fixing “john smith”.)
Step 3: Handling Missing Values
Empty cells break formulas. Here’s how AI helps you deal with them:
Option A — Fill with defaults:
“Column D has some blank cells. Create a formula for column E that copies the value from D if it exists, or shows ‘N/A’ if D is blank.”
Option B — Fill with calculated values:
“Column C has sales figures but some cells are blank. Create a formula that fills blanks with the average of the non-blank values in column C.”
Option C — Flag for review:
“Create a formula for column F that shows ‘MISSING’ if any cell in columns A through E for that row is blank, and ‘COMPLETE’ if all cells have values.”
| Strategy | When to Use | AI Prompt Pattern |
|---|---|---|
| Fill with default | Non-critical fields (notes, secondary info) | “If blank, show [default value]” |
| Fill with calculation | Numeric fields where estimation is acceptable | “If blank, use the average/median of the column” |
| Flag for review | Critical fields (IDs, amounts, dates) | “Mark rows with any blank as INCOMPLETE” |
Step 4: Validation
After cleaning, verify your work:
Prompt:
“I’ve cleaned data in columns A through E. Create formulas that check: (1) all dates in column A are valid dates in 2026, (2) all emails in column B contain an @ symbol, (3) all amounts in column C are positive numbers. Show PASS or FAIL for each check.”
This gives you a quality dashboard for your cleaned data.
Batch Cleaning with AI Chat
For one-off cleaning jobs, you don’t even need formulas. Copy the messy data directly into an AI chat:
Prompt:
“Here’s a list of 50 company names with inconsistencies. Clean them up: standardize capitalization, remove extra spaces, fix obvious typos, and return the cleaned list in the same order.”
Then paste the cleaned data back. This works great for datasets under a few hundred rows.
Practice Exercise
Try this with your own data or create a test dataset:
- Create 20 rows with intentional problems: mixed date formats, duplicate emails, inconsistent names, blank cells
- Ask AI to generate a duplicate-detection formula
- Ask AI for a date standardization formula
- Ask AI for a validation formula that checks all rows
- Compare the time it takes vs. fixing manually
Key Takeaways
- Follow the cleaning order: duplicates → formats → blanks → validation
- Use the same CONTEXT → TASK → CONSTRAINTS → OUTPUT framework for cleaning prompts
- Always specify the exact inconsistencies you see — AI handles them better with examples
- Flag issues before deleting — review first, then remove
- For small datasets, paste directly into AI chat for instant cleaning
Up Next
In the next lesson, you’ll master the formulas that terrify most spreadsheet users: VLOOKUP, INDEX/MATCH, and XLOOKUP. With AI, these become as easy as describing what you want to find.
Knowledge Check
Complete the quiz above first
Lesson completed!