Lesson 3 12 min

Data Processing with pandas

Learn to automate data processing with Python and pandas — reading CSVs and Excel files, cleaning messy data, transforming columns, and generating reports with AI.

🔄 Recall Bridge: In the previous lesson, you built file automation scripts — organizers, renamers, and backup tools. Now let’s tackle the data inside those files: CSVs, Excel workbooks, and JSON that need cleaning, transforming, and reporting.

Data processing is where Python automation has the biggest impact. The same spreadsheet manipulation you do manually in 20 minutes every week can be automated to run in seconds — and it’ll be more accurate because scripts don’t get tired or make typos.

pandas: Your Data Automation Engine

pandas is the Python library for data manipulation. Install it:

pip install pandas openpyxl

Core pandas operations for automation:

Taskpandas CodeWhat It Does
Read CSVpd.read_csv("data.csv")Load CSV into a DataFrame
Read Excelpd.read_excel("data.xlsx", sheet_name="Sheet1")Load specific sheet
Filter rowsdf[df["status"] == "active"]Keep only matching rows
Rename columnsdf.rename(columns={"old": "new"})Clean up column names
Remove duplicatesdf.drop_duplicates(subset=["email"])Deduplicate by column
Fill missingdf["email"].fillna("unknown")Handle missing values
Group & aggregatedf.groupby("category")["amount"].sum()Summary statistics
Save to Exceldf.to_excel("output.xlsx", index=False)Export formatted

Script 1: Monthly Report Processor

AI prompt:

Write a Python script using pandas that processes a monthly sales CSV: (1) Read the CSV, skip metadata rows (first 3 lines), (2) Clean column names: lowercase, replace spaces with underscores, (3) Convert the “amount” column: strip “$” and “,” and convert to float, (4) Parse the “date” column into datetime, (5) Filter out rows where “status” is “cancelled” or “test”, (6) Add a “month” column extracted from the date, (7) Generate a summary: total revenue, number of orders, average order value, top 5 customers by total spend, (8) Save cleaned data and summary to separate sheets in an Excel file. Accept input/output paths as arguments.

Script 2: Multi-File Data Merger

AI prompt:

Write a pandas script that merges data from multiple Excel files in a folder. (1) Read all .xlsx files in a specified folder, (2) Each file has the same columns but data from different months/regions, (3) Concatenate all files into one DataFrame, (4) Add a “source_file” column tracking which file each row came from, (5) Remove duplicate rows (same order_id across files), (6) Sort by date, (7) Save the combined data as both CSV and Excel. Print a summary: files processed, total rows, duplicates removed.

Common data merging scenarios:

Scenariopandas FunctionExample
Stack same-format filespd.concat([df1, df2, df3])Monthly reports into one
Join on shared columnpd.merge(df1, df2, on="customer_id")Customers + Orders
Lookup valuesdf1.merge(df2[["id", "name"]], on="id")Add names from a reference table

Script 3: Data Quality Report

AI prompt:

Write a pandas script that generates a data quality report for any CSV file: (1) For each column: count null values, count unique values, data type, sample values, (2) Flag potential issues: columns with >10% missing values, columns that look like dates but are stored as strings, numeric columns stored as strings (contain “$” or “,”), duplicate rows, (3) Generate an HTML report with a table showing column stats and a list of issues found. This script should work on ANY CSV — don’t assume specific column names.

Quick Check: Your script processes a CSV where some rows have commas inside quoted fields: "Smith, John" in the name column. Will pd.read_csv() handle this correctly? (Answer: Yes — pandas handles quoted fields correctly by default. The standard CSV format uses quotes to escape commas within fields. However, if your CSV uses a non-standard delimiter or quoting, you may need to specify delimiter and quotechar parameters. AI tip: if your CSV looks weird, paste the first 5 lines into AI and ask “What read_csv parameters do I need for this format?”)

Working with Excel Formatting

pandas can read and write Excel, but for formatted output (bold headers, conditional formatting), use openpyxl:

AI prompt for formatted Excel output:

Enhance my pandas script to save formatted Excel output using openpyxl: (1) Bold, blue headers with frozen panes, (2) Currency format ($#,##0.00) on amount columns, (3) Date format (YYYY-MM-DD) on date columns, (4) Auto-adjust column widths to fit content, (5) Add conditional formatting: red background for negative amounts, green for amounts over $1,000, (6) Add a summary row at the bottom with totals.

Key Takeaways

  • pandas turns 20 minutes of manual Excel work into a 2-second script: reading, cleaning, transforming, and exporting data are all one-liners in pandas, and AI generates the complete pipeline when you describe your specific data and desired output
  • Data cleaning is the highest-value automation target because the same messy data patterns appear in every report: missing values, inconsistent date formats, string-encoded numbers, and duplicate rows — build the cleaning pipeline once and it runs perfectly every time
  • Be specific with AI about your data format: paste sample rows, describe column names, specify what “clean” means to you, and include edge cases (quoted commas, mixed date formats, special characters) — this produces scripts that work on your first try

Up Next

In the next lesson, you’ll learn web scraping — extracting structured data from websites using Python and AI to write the parsing logic.

Knowledge Check

1. You receive a monthly sales report as a CSV file. Every month, you manually open it in Excel, delete the first 3 header rows, rename columns, filter out test orders, and save it as a formatted Excel file. This takes 20 minutes. You want to automate it. What's the most effective AI prompt?

2. Your script reads a CSV with 50,000 rows. Some rows have missing values in the 'email' column, some have dates in different formats ('2026-01-15', '01/15/2026', 'Jan 15, 2026'), and some prices have dollar signs. How should you handle this with AI?

3. You need to merge data from 3 Excel files — one has customer info, one has orders, one has shipping data. Each file uses a different column name for the customer ID: 'CustomerID', 'cust_id', and 'customer_number'. How do you merge them?

Answer all questions to check

Complete the quiz above first

Related Skills