Lesson 3 12 min

Data Processing with pandas

Learn to automate data processing with Python and pandas — reading CSVs and Excel files, cleaning messy data, transforming columns, and generating reports with AI.

🔄 Recall Bridge: In the previous lesson, you built file automation scripts — organizers, renamers, and backup tools. Now let’s tackle the data inside those files: CSVs, Excel workbooks, and JSON that need cleaning, transforming, and reporting.

Data processing is where Python automation has the biggest impact. The same spreadsheet manipulation you do manually in 20 minutes every week can be automated to run in seconds — and it’ll be more accurate because scripts don’t get tired or make typos.

pandas: Your Data Automation Engine

pandas is the Python library for data manipulation. Install it:

pip install pandas openpyxl

Core pandas operations for automation:

Task	pandas Code	What It Does
Read CSV	`pd.read_csv("data.csv")`	Load CSV into a DataFrame
Read Excel	`pd.read_excel("data.xlsx", sheet_name="Sheet1")`	Load specific sheet
Filter rows	`df[df["status"] == "active"]`	Keep only matching rows
Rename columns	`df.rename(columns={"old": "new"})`	Clean up column names
Remove duplicates	`df.drop_duplicates(subset=["email"])`	Deduplicate by column
Fill missing	`df["email"].fillna("unknown")`	Handle missing values
Group & aggregate	`df.groupby("category")["amount"].sum()`	Summary statistics
Save to Excel	`df.to_excel("output.xlsx", index=False)`	Export formatted

Script 1: Monthly Report Processor

AI prompt:

Write a Python script using pandas that processes a monthly sales CSV: (1) Read the CSV, skip metadata rows (first 3 lines), (2) Clean column names: lowercase, replace spaces with underscores, (3) Convert the “amount” column: strip “$” and “,” and convert to float, (4) Parse the “date” column into datetime, (5) Filter out rows where “status” is “cancelled” or “test”, (6) Add a “month” column extracted from the date, (7) Generate a summary: total revenue, number of orders, average order value, top 5 customers by total spend, (8) Save cleaned data and summary to separate sheets in an Excel file. Accept input/output paths as arguments.

Script 2: Multi-File Data Merger

AI prompt:

Write a pandas script that merges data from multiple Excel files in a folder. (1) Read all .xlsx files in a specified folder, (2) Each file has the same columns but data from different months/regions, (3) Concatenate all files into one DataFrame, (4) Add a “source_file” column tracking which file each row came from, (5) Remove duplicate rows (same order_id across files), (6) Sort by date, (7) Save the combined data as both CSV and Excel. Print a summary: files processed, total rows, duplicates removed.

Common data merging scenarios:

Scenario	pandas Function	Example
Stack same-format files	`pd.concat([df1, df2, df3])`	Monthly reports into one
Join on shared column	`pd.merge(df1, df2, on="customer_id")`	Customers + Orders
Lookup values	`df1.merge(df2[["id", "name"]], on="id")`	Add names from a reference table

Script 3: Data Quality Report

AI prompt:

Write a pandas script that generates a data quality report for any CSV file: (1) For each column: count null values, count unique values, data type, sample values, (2) Flag potential issues: columns with >10% missing values, columns that look like dates but are stored as strings, numeric columns stored as strings (contain “$” or “,”), duplicate rows, (3) Generate an HTML report with a table showing column stats and a list of issues found. This script should work on ANY CSV — don’t assume specific column names.

✅ Quick Check: Your script processes a CSV where some rows have commas inside quoted fields: "Smith, John" in the name column. Will pd.read_csv() handle this correctly? (Answer: Yes — pandas handles quoted fields correctly by default. The standard CSV format uses quotes to escape commas within fields. However, if your CSV uses a non-standard delimiter or quoting, you may need to specify delimiter and quotechar parameters. AI tip: if your CSV looks weird, paste the first 5 lines into AI and ask “What read_csv parameters do I need for this format?”)

Working with Excel Formatting

pandas can read and write Excel, but for formatted output (bold headers, conditional formatting), use openpyxl:

AI prompt for formatted Excel output:

Enhance my pandas script to save formatted Excel output using openpyxl: (1) Bold, blue headers with frozen panes, (2) Currency format ($#,##0.00) on amount columns, (3) Date format (YYYY-MM-DD) on date columns, (4) Auto-adjust column widths to fit content, (5) Add conditional formatting: red background for negative amounts, green for amounts over $1,000, (6) Add a summary row at the bottom with totals.

Key Takeaways

pandas turns 20 minutes of manual Excel work into a 2-second script: reading, cleaning, transforming, and exporting data are all one-liners in pandas, and AI generates the complete pipeline when you describe your specific data and desired output
Data cleaning is the highest-value automation target because the same messy data patterns appear in every report: missing values, inconsistent date formats, string-encoded numbers, and duplicate rows — build the cleaning pipeline once and it runs perfectly every time
Be specific with AI about your data format: paste sample rows, describe column names, specify what “clean” means to you, and include edge cases (quoted commas, mixed date formats, special characters) — this produces scripts that work on your first try

Up Next

In the next lesson, you’ll learn web scraping — extracting structured data from websites using Python and AI to write the parsing logic.

Knowledge Check

1. You receive a monthly sales report as a CSV file. Every month, you manually open it in Excel, delete the first 3 header rows, rename columns, filter out test orders, and save it as a formatted Excel file. This takes 20 minutes. You want to automate it. What's the most effective AI prompt?

'Write a Python script to clean a CSV file' — keep it simple and let AI figure out the details Be specific about every step: 'Write a Python script using pandas that: (1) Reads a CSV file, skipping the first 3 rows (header=3), (2) Renames columns: "Order ID" → "order_id", "Customer Name" → "customer", "Total $" → "amount", (3) Removes rows where the "customer" column contains "TEST" or "test", (4) Converts the "amount" column to float (remove $ signs and commas), (5) Saves to Excel with formatting: bold headers, currency format on amount column, auto-adjusted column widths. Accept input CSV path and output Excel path as command-line arguments.' This prompt produces a working script on the first try because AI knows exactly what you need 'Automate my monthly sales report processing' — AI should understand what processing means

2. Your script reads a CSV with 50,000 rows. Some rows have missing values in the 'email' column, some have dates in different formats ('2026-01-15', '01/15/2026', 'Jan 15, 2026'), and some prices have dollar signs. How should you handle this with AI?

Ask AI to handle each problem separately in different scripts Describe all the data quality issues in one prompt and let AI build a cleaning pipeline. Prompt: 'Add data cleaning to my script: (1) For missing emails: fill with "no-email@unknown.com" and flag these rows in a new column "email_missing" = True. (2) For dates: parse all three formats (YYYY-MM-DD, MM/DD/YYYY, Mon DD, YYYY) into a standard YYYY-MM-DD format using pd.to_datetime with infer_datetime_format. (3) For prices: strip $ and commas, convert to float. (4) Add a data quality summary at the end: how many rows had missing emails, unparseable dates, or invalid prices.' Building one cleaning pipeline is more maintainable than separate scripts, and pandas is designed to chain these operations efficiently Clean the data manually in Excel first, then feed the clean data to your script

3. You need to merge data from 3 Excel files — one has customer info, one has orders, one has shipping data. Each file uses a different column name for the customer ID: 'CustomerID', 'cust_id', and 'customer_number'. How do you merge them?

Manually rename the columns in Excel first to make them consistent, then merge Use pandas to rename and merge in one script. AI prompt: 'Write a pandas script that: (1) Reads three Excel files: customers.xlsx, orders.xlsx, shipping.xlsx, (2) Renames the join column to a consistent name — customers.xlsx has "CustomerID", orders.xlsx has "cust_id", shipping.xlsx has "customer_number" — rename all to "customer_id", (3) Merges all three on "customer_id" using left joins (keep all customers even if they have no orders), (4) Flags unmatched records: add columns "has_orders" and "has_shipping" (True/False), (5) Saves the merged result with a summary sheet showing: total customers, customers with orders, customers with shipping data.' Pandas merge operations are fast (50,000 rows in seconds) and the script runs identically every month Import all three files into a database and use SQL to join them

Answer all questions to check

Complete the quiz above first

Related Skills

Workflow Automator Python Packaging