Query Optimization and Performance

🔄 Quick Recall: In the last lesson, you cleaned messy data with AI — profiling, standardizing, deduplicating, and validating. Now let’s make sure the queries running against that clean data are fast.

When Queries Get Slow

Slow queries usually announce themselves: a dashboard that takes 30 seconds to load, a report that times out, or an API endpoint that makes users wait. The cause is rarely the database itself — it’s almost always the query or missing indexes.

AI is exceptionally good at optimization because it can read execution plans and spot patterns that take humans years of experience to recognize.

Reading Execution Plans with AI

The EXPLAIN ANALYZE command is your diagnostic tool:

-- PostgreSQL
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.id) as order_count, SUM(o.total) as total_spent
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.order_date >= '2025-01-01'
GROUP BY c.name
ORDER BY total_spent DESC
LIMIT 20;

The output looks cryptic. Paste it into AI:

Here's the EXPLAIN ANALYZE output for a slow query on PostgreSQL:

[paste the full execution plan]

1. What's the biggest performance bottleneck?
2. Is the database doing any sequential scans it shouldn't?
3. Are the row estimates accurate or wildly off?
4. What indexes would help this query?
5. Can you rewrite the query to be faster?

✅ Quick Check: Why does it matter if the database’s row estimates are “wildly off” in the execution plan?

Because the query planner chooses its strategy based on estimated row counts. If it estimates 100 rows but the actual result is 100,000, it might choose a nested loop join (efficient for small sets) when a hash join (efficient for large sets) would be dramatically faster. Inaccurate estimates often mean your table statistics are stale — running ANALYZE on the table fixes this.

Strategic Indexing

Not all indexes are equal. AI can help you choose wisely:

Database: PostgreSQL
Table: orders (15 million rows)
Columns: id, customer_id, product_id, total, order_date, status, region, created_at

Most frequent queries:
1. SELECT * FROM orders WHERE customer_id = ? AND order_date >= ?
2. SELECT region, SUM(total) FROM orders WHERE status = 'completed' GROUP BY region
3. SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at
4. SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id HAVING COUNT(*) > 10

Currently no indexes except the primary key.

Recommend indexes for these queries. For each:
1. The CREATE INDEX statement
2. Which query it helps and why
3. The trade-off (how much it slows writes)
4. Whether a partial index or composite index would be better

Key indexing principles:

Composite indexes: Put the equality column first, range column second. (customer_id, order_date) helps query #1 — not (order_date, customer_id).
Partial indexes: If you only query pending orders, CREATE INDEX ON orders (created_at) WHERE status = 'pending' is smaller and faster than indexing all statuses.
Don’t over-index: Each index slows INSERT/UPDATE. If a table gets 10,000 writes per second, adding 5 indexes creates serious overhead.

Common Optimization Patterns

AI consistently suggests these proven patterns:

Replace SELECT * with specific columns:

-- Slow (reads all columns from disk)
SELECT * FROM orders WHERE customer_id = 123;

-- Fast (reads only needed columns, can use covering index)
SELECT id, total, order_date FROM orders WHERE customer_id = 123;

Move filtering from HAVING to WHERE:

-- Slow (groups all rows, then filters)
SELECT customer_id, SUM(total) FROM orders
GROUP BY customer_id HAVING customer_id IN (1, 2, 3);

-- Fast (filters first, then groups fewer rows)
SELECT customer_id, SUM(total) FROM orders
WHERE customer_id IN (1, 2, 3) GROUP BY customer_id;

Use EXISTS instead of IN for subqueries:

-- Slow with large subquery results
SELECT * FROM customers WHERE id IN (SELECT customer_id FROM orders);

-- Fast (stops at first match per row)
SELECT * FROM customers c WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);

✅ Quick Check: Why is EXISTS faster than IN when the subquery returns many rows?

Because IN materializes the full subquery result (potentially millions of values) and checks each against the outer row. EXISTS stops at the first match — once it finds one order for a customer, it moves on. For the question “does this customer have any orders?” EXISTS does minimal work per row, while IN does maximum work.

Monitoring Ongoing Performance

Optimization isn’t one-time. Set up ongoing monitoring:

I'm using PostgreSQL. Help me set up performance monitoring:

1. A query to find the slowest queries in the last 24 hours (using pg_stat_statements)
2. A query to identify tables that need VACUUM or ANALYZE
3. A query to find unused indexes (wasting write performance)
4. A query to find missing indexes (tables with high sequential scan ratios)
5. A weekly maintenance routine I should follow

Exercise: Optimize a Slow Query

Take a slow query from your own work (or create one using the orders schema from this lesson):

Run EXPLAIN ANALYZE and paste the output into AI
Get AI’s analysis: what’s the bottleneck?
Implement the suggested optimization (index, rewrite, or both)
Run EXPLAIN ANALYZE again and compare the results
Document the before/after execution times

Key Takeaways

EXPLAIN ANALYZE is your diagnostic tool — paste execution plans into AI for instant, expert-level analysis
Strategic indexing considers query patterns, column cardinality, and write overhead — not every column needs an index
Composite indexes must match your query patterns: equality columns first, range columns second
Common quick wins: replace SELECT *, move filtering to WHERE, use EXISTS instead of IN for large subqueries
Stale table statistics cause the query planner to choose wrong strategies — run ANALYZE regularly
Ongoing monitoring catches slow queries before they become user complaints

Up Next: In the next lesson, you’ll build reports and dashboards with AI — turning your optimized queries into business intelligence that stakeholders actually use.