Lesson 3 15 min

Data Cleaning and Preparation with AI

Learn AI-powered techniques to clean messy spreadsheet data: fix dates, remove duplicates, standardize formats, and handle missing values in minutes.

You just received a spreadsheet from a client. The dates are in three different formats. Names are sometimes “John Smith,” sometimes “smith, john,” sometimes “JOHN SMITH.” There are 47 duplicate rows, and random blank cells scattered throughout.

This is reality. And it’s where AI saves you the most time.

🔄 Quick Recall: In the previous lesson, you learned the CONTEXT → TASK → CONSTRAINTS → OUTPUT prompt framework. You’ll use that same framework here — but instead of generating calculation formulas, you’ll generate cleaning formulas.

The Data Cleaning Workflow

Before diving into AI prompts, here’s the order that works best:

  1. Remove duplicates — Shrink the dataset first
  2. Standardize formats — Dates, names, categories, phone numbers
  3. Handle missing values — Fill, flag, or remove blank cells
  4. Validate — Check cleaned data against known values

Let’s tackle each step with AI.

Step 1: Finding and Removing Duplicates

The problem: Your dataset has 2,000 rows but some entries appear multiple times with slight variations (“John Smith” vs “John Smith” with an extra space).

AI prompt:

“In Google Sheets, I have data in columns A through E (Name, Email, Phone, Date, Amount). Create a formula for column F that flags TRUE if this row is a duplicate based on the Email in column B (case-insensitive). Mark the first occurrence as FALSE and subsequent duplicates as TRUE.”

AI will generate something like:

=COUNTIF(B$1:B1, LOWER(B2)) > 0

After flagging duplicates, you can filter and review before deleting.

Tip: Always flag duplicates first instead of auto-deleting. Some “duplicates” are legitimate (e.g., two orders from the same customer).

Quick Check: Why should you flag duplicates before deleting them? (Answer: Some apparent duplicates may be legitimate entries — like repeat orders from the same customer — that you don’t want to remove.)

Step 2: Standardizing Formats

Format inconsistencies are the most common data problem. Here are AI prompts for the big three:

Dates

Prompt:

“Column A has dates in mixed formats: some are MM/DD/YYYY, some are DD-Mon-YYYY (like 15-Jan-2026), and some are just text like ‘January 15, 2026’. Create a formula for column B that converts all of these to YYYY-MM-DD format.”

Names

Prompt:

“Column A has names in inconsistent formats: ‘john smith’, ‘SMITH, JOHN’, ‘John Smith’, ‘smith,john’. Create a formula that standardizes all names to ‘First Last’ format with proper capitalization.”

AI typically uses a combination of PROPER(), TRIM(), SUBSTITUTE(), and text extraction functions.

Phone Numbers

Prompt:

“Column C has US phone numbers in mixed formats: (555) 123-4567, 555-123-4567, 5551234567, +1 555 123 4567. Create a formula that standardizes all to (XXX) XXX-XXXX format.”

Quick Check: When standardizing names, why is specifying the inconsistent formats in your prompt important? (Answer: AI needs to know what variations exist so it can generate logic that handles each case — a formula for fixing “SMITH, JOHN” is different from one fixing “john smith”.)

Step 3: Handling Missing Values

Empty cells break formulas. Here’s how AI helps you deal with them:

Option A — Fill with defaults:

“Column D has some blank cells. Create a formula for column E that copies the value from D if it exists, or shows ‘N/A’ if D is blank.”

Option B — Fill with calculated values:

“Column C has sales figures but some cells are blank. Create a formula that fills blanks with the average of the non-blank values in column C.”

Option C — Flag for review:

“Create a formula for column F that shows ‘MISSING’ if any cell in columns A through E for that row is blank, and ‘COMPLETE’ if all cells have values.”

StrategyWhen to UseAI Prompt Pattern
Fill with defaultNon-critical fields (notes, secondary info)“If blank, show [default value]”
Fill with calculationNumeric fields where estimation is acceptable“If blank, use the average/median of the column”
Flag for reviewCritical fields (IDs, amounts, dates)“Mark rows with any blank as INCOMPLETE”

Step 4: Validation

After cleaning, verify your work:

Prompt:

“I’ve cleaned data in columns A through E. Create formulas that check: (1) all dates in column A are valid dates in 2026, (2) all emails in column B contain an @ symbol, (3) all amounts in column C are positive numbers. Show PASS or FAIL for each check.”

This gives you a quality dashboard for your cleaned data.

Batch Cleaning with AI Chat

For one-off cleaning jobs, you don’t even need formulas. Copy the messy data directly into an AI chat:

Prompt:

“Here’s a list of 50 company names with inconsistencies. Clean them up: standardize capitalization, remove extra spaces, fix obvious typos, and return the cleaned list in the same order.”

Then paste the cleaned data back. This works great for datasets under a few hundred rows.

Practice Exercise

Try this with your own data or create a test dataset:

  1. Create 20 rows with intentional problems: mixed date formats, duplicate emails, inconsistent names, blank cells
  2. Ask AI to generate a duplicate-detection formula
  3. Ask AI for a date standardization formula
  4. Ask AI for a validation formula that checks all rows
  5. Compare the time it takes vs. fixing manually

Key Takeaways

  • Follow the cleaning order: duplicates → formats → blanks → validation
  • Use the same CONTEXT → TASK → CONSTRAINTS → OUTPUT framework for cleaning prompts
  • Always specify the exact inconsistencies you see — AI handles them better with examples
  • Flag issues before deleting — review first, then remove
  • For small datasets, paste directly into AI chat for instant cleaning

Up Next

In the next lesson, you’ll master the formulas that terrify most spreadsheet users: VLOOKUP, INDEX/MATCH, and XLOOKUP. With AI, these become as easy as describing what you want to find.

Knowledge Check

1. You have a column with dates in mixed formats (01/15/2026, January 15, 2026, 2026-01-15). What's the best AI prompt to fix this?

2. What's the correct order for a data cleaning workflow?

3. Which AI prompt would best help you find and fix inconsistent category names?

Answer all questions to check

Complete the quiz above first

Related Skills