Data Processing with pandas
Learn to automate data processing with Python and pandas — reading CSVs and Excel files, cleaning messy data, transforming columns, and generating reports with AI.
🔄 Recall Bridge: In the previous lesson, you built file automation scripts — organizers, renamers, and backup tools. Now let’s tackle the data inside those files: CSVs, Excel workbooks, and JSON that need cleaning, transforming, and reporting.
Data processing is where Python automation has the biggest impact. The same spreadsheet manipulation you do manually in 20 minutes every week can be automated to run in seconds — and it’ll be more accurate because scripts don’t get tired or make typos.
pandas: Your Data Automation Engine
pandas is the Python library for data manipulation. Install it:
pip install pandas openpyxl
Core pandas operations for automation:
| Task | pandas Code | What It Does |
|---|---|---|
| Read CSV | pd.read_csv("data.csv") | Load CSV into a DataFrame |
| Read Excel | pd.read_excel("data.xlsx", sheet_name="Sheet1") | Load specific sheet |
| Filter rows | df[df["status"] == "active"] | Keep only matching rows |
| Rename columns | df.rename(columns={"old": "new"}) | Clean up column names |
| Remove duplicates | df.drop_duplicates(subset=["email"]) | Deduplicate by column |
| Fill missing | df["email"].fillna("unknown") | Handle missing values |
| Group & aggregate | df.groupby("category")["amount"].sum() | Summary statistics |
| Save to Excel | df.to_excel("output.xlsx", index=False) | Export formatted |
Script 1: Monthly Report Processor
AI prompt:
Write a Python script using pandas that processes a monthly sales CSV: (1) Read the CSV, skip metadata rows (first 3 lines), (2) Clean column names: lowercase, replace spaces with underscores, (3) Convert the “amount” column: strip “$” and “,” and convert to float, (4) Parse the “date” column into datetime, (5) Filter out rows where “status” is “cancelled” or “test”, (6) Add a “month” column extracted from the date, (7) Generate a summary: total revenue, number of orders, average order value, top 5 customers by total spend, (8) Save cleaned data and summary to separate sheets in an Excel file. Accept input/output paths as arguments.
Script 2: Multi-File Data Merger
AI prompt:
Write a pandas script that merges data from multiple Excel files in a folder. (1) Read all .xlsx files in a specified folder, (2) Each file has the same columns but data from different months/regions, (3) Concatenate all files into one DataFrame, (4) Add a “source_file” column tracking which file each row came from, (5) Remove duplicate rows (same order_id across files), (6) Sort by date, (7) Save the combined data as both CSV and Excel. Print a summary: files processed, total rows, duplicates removed.
Common data merging scenarios:
| Scenario | pandas Function | Example |
|---|---|---|
| Stack same-format files | pd.concat([df1, df2, df3]) | Monthly reports into one |
| Join on shared column | pd.merge(df1, df2, on="customer_id") | Customers + Orders |
| Lookup values | df1.merge(df2[["id", "name"]], on="id") | Add names from a reference table |
Script 3: Data Quality Report
AI prompt:
Write a pandas script that generates a data quality report for any CSV file: (1) For each column: count null values, count unique values, data type, sample values, (2) Flag potential issues: columns with >10% missing values, columns that look like dates but are stored as strings, numeric columns stored as strings (contain “$” or “,”), duplicate rows, (3) Generate an HTML report with a table showing column stats and a list of issues found. This script should work on ANY CSV — don’t assume specific column names.
✅ Quick Check: Your script processes a CSV where some rows have commas inside quoted fields:
"Smith, John"in the name column. Willpd.read_csv()handle this correctly? (Answer: Yes — pandas handles quoted fields correctly by default. The standard CSV format uses quotes to escape commas within fields. However, if your CSV uses a non-standard delimiter or quoting, you may need to specifydelimiterandquotecharparameters. AI tip: if your CSV looks weird, paste the first 5 lines into AI and ask “What read_csv parameters do I need for this format?”)
Working with Excel Formatting
pandas can read and write Excel, but for formatted output (bold headers, conditional formatting), use openpyxl:
AI prompt for formatted Excel output:
Enhance my pandas script to save formatted Excel output using openpyxl: (1) Bold, blue headers with frozen panes, (2) Currency format ($#,##0.00) on amount columns, (3) Date format (YYYY-MM-DD) on date columns, (4) Auto-adjust column widths to fit content, (5) Add conditional formatting: red background for negative amounts, green for amounts over $1,000, (6) Add a summary row at the bottom with totals.
Key Takeaways
- pandas turns 20 minutes of manual Excel work into a 2-second script: reading, cleaning, transforming, and exporting data are all one-liners in pandas, and AI generates the complete pipeline when you describe your specific data and desired output
- Data cleaning is the highest-value automation target because the same messy data patterns appear in every report: missing values, inconsistent date formats, string-encoded numbers, and duplicate rows — build the cleaning pipeline once and it runs perfectly every time
- Be specific with AI about your data format: paste sample rows, describe column names, specify what “clean” means to you, and include edge cases (quoted commas, mixed date formats, special characters) — this produces scripts that work on your first try
Up Next
In the next lesson, you’ll learn web scraping — extracting structured data from websites using Python and AI to write the parsing logic.
Knowledge Check
Complete the quiz above first
Lesson completed!