AI Performance Monitoring

Build AI-powered database performance monitoring — baseline establishment, slow query detection, anomaly alerts, capacity planning, and proactive optimization that prevents outages.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

🔄 Quick Recall: In the previous lesson, you built safe migration workflows. Now you’ll build the monitoring systems that detect performance problems before they become outages — because the best optimization is the one you make before users notice.

Database performance degrades gradually, then suddenly. A query that worked fine at 100K rows crawls at 1M. A table that was fast to write becomes slow as indexes multiply. Connection pools that were sufficient at launch exhaust at 2× traffic. AI monitoring establishes baselines, detects deviations, and explains causes — turning reactive firefighting into proactive optimization.

Baseline Establishment

AI prompt for performance baseline:

Establish a performance baseline for my database. Database: [ENGINE AND VERSION]. Current metrics: CPU usage: [%], memory usage: [%], disk I/O: [MB/s], active connections: [NUMBER], queries per second: [QPS]. Top 10 slowest queries: [LIST WITH EXECUTION TIMES]. Table sizes: [LIST LARGEST TABLES WITH ROW COUNTS]. For each metric: (1) establish the normal range (mean ± 2 standard deviations), (2) define warning threshold (approaching concern), (3) define critical threshold (requires immediate action), (4) identify the most likely cause when this metric spikes. Create a monitoring dashboard specification with these metrics, thresholds, and alerting rules.

Key database metrics to monitor:

Metric	Normal	Warning	Critical	Likely Cause
CPU usage	20-40%	> 60%	> 85%	Bad query, missing index, lock contention
Memory usage	60-80%	> 85%	> 95%	Buffer pool too small, memory leak, too many connections
Disk I/O	Baseline + 20%	Baseline + 50%	Baseline + 100%	Full table scans, backup running, large sort operations
Active connections	Pool size × 60%	Pool size × 80%	Pool size × 95%	Connection leak, slow queries holding connections
Query latency (p95)	< 100ms	> 500ms	> 2s	Missing index, data growth, lock contention
Replication lag	< 1s	> 5s	> 30s	Heavy writes, network issues, replica capacity

Slow Query Monitoring

AI prompt for slow query analysis:

Analyze my database slow query log for the past [TIME PERIOD]. Log data: [PASTE OR DESCRIBE — queries, execution times, frequency]. Perform: (1) Top 10 queries by total time (frequency × average execution time) — these are your optimization priorities, (2) queries that recently got slower — potential plan regressions or data growth issues, (3) queries that appear in bursts — potential batch jobs or N+1 patterns, (4) queries with high variance (sometimes 50ms, sometimes 5s) — potential lock contention or resource competition. For each flagged query: the specific optimization recommendation with expected improvement.

✅ Quick Check: Your slow query log shows one query consuming 40% of total database time: SELECT * FROM events WHERE user_id = ? AND created_at > ? ORDER BY created_at DESC LIMIT 50. It runs 500 times/day averaging 800ms. What’s the first thing to check? (Answer: Does a composite index exist on (user_id, created_at)? This query filters on user_id and sorts by created_at — a composite index in that order serves both operations. Without it, the database scans all events for a user, then sorts. With it, the database reads the 50 most recent events directly from the index. Expected improvement: 800ms → 5-20ms.)

Anomaly Detection

AI prompt for anomaly alerting:

Design an anomaly detection system for my database monitoring. Metrics: [LIST YOUR KEY METRICS WITH NORMAL BASELINES]. Create alerting rules that: (1) detect deviations from baseline (not just threshold breaches) — a database normally at 30% CPU alerting at 50% is more useful than a fixed 80% threshold, (2) correlate metrics — CPU spike + slow query log entries suggests a query problem, CPU spike + connection spike suggests a traffic problem, (3) suppress false positives — scheduled backups, known maintenance windows, and expected traffic patterns (Monday morning spike) shouldn’t trigger alerts, (4) escalate appropriately — warning alerts go to the on-call channel, critical alerts page the DBA. Define the alert message format: what metric deviated, the current and baseline values, correlated events, and suggested investigation steps.

Capacity Planning

AI prompt for capacity forecasting:

Create a capacity plan for my database. Current state: storage: [USED/TOTAL], CPU: [AVERAGE %], memory: [USED/TOTAL], connections: [AVERAGE/MAX], QPS: [CURRENT]. Growth trends: [MONTHLY DATA GROWTH, TRAFFIC GROWTH RATE]. Forecast: (1) when each resource will hit warning and critical thresholds at current growth, (2) when each resource will hit thresholds if growth accelerates by 50% (new feature launch, marketing push), (3) recommended actions with timelines — which resources need attention first and when, (4) cost-effective scaling options — vertical (bigger server) vs. horizontal (read replicas, sharding) vs. optimization (reduce data with archiving, reduce load with caching). Present a 6-month forecast with monthly checkpoints.

Key Takeaways

Baseline-relative monitoring (alert when CPU rises from normal 30% to 50%) catches issues far earlier than fixed-threshold monitoring (alert when CPU hits 80%) — AI establishes baselines and detects deviations that indicate emerging problems
When a query suddenly slows without code changes, the cause is usually stale optimizer statistics, data growth, or resource contention — AI checks each systematically instead of guessing, and running ANALYZE on affected tables fixes the most common cause
Slow query prioritization by total cost (frequency × execution time) focuses optimization on the queries that consume the most database resources — one 800ms query running 500 times/day costs more than one 5-second query running twice
Capacity planning should start at 70% resource utilization, accounting for growth acceleration, operational overhead (15-20% for temp tables and migrations), and one-time events — AI projects realistic timelines that simple linear projections miss
Metric correlation (CPU spike + new query pattern, or connection spike + slow queries) identifies root causes faster than looking at individual metrics — AI connects the “what” (metric anomaly) to the “why” (specific query or event)

Up Next

In the next lesson, you’ll build backup and recovery strategies — verified backup procedures, point-in-time recovery, and the disaster recovery plans that protect your data.

Knowledge Check

1. Your database CPU usage has been steady at 30% for months. This morning it's at 75% and climbing. Users haven't reported issues yet. What should happen?

Wait — 75% isn't critical yet. Investigate if it hits 90% AI anomaly detection should have already triggered an alert when CPU crossed 2 standard deviations above the baseline (30% ± 5% historically, so an alert at ~40%). At 75%, the investigation should be underway: (1) Check slow query log — is a new query consuming excessive resources? (2) Check query volume — did traffic spike or is the same traffic hitting harder? (3) Check for table locks — is a migration or batch job running? (4) Check connection count — are connections pooling correctly? AI correlates the CPU spike with database metrics: 'CPU increased from 30% to 75% starting at 8:42am. Correlated event: a new query pattern appeared at 8:40am — SELECT * FROM orders JOIN products ON... with no index on products.order_id. This query runs 200 times/minute and performs a full table scan each time' Add more CPU — the database is outgrowing its server

2. A query that normally takes 50ms suddenly takes 2 seconds. Nothing changed in the code. What should AI analyze?

The database needs a restart — it probably needs to clear its cache A systematic investigation: (1) Execution plan comparison — run EXPLAIN on the query now vs. what it was before. Did the plan change? (2) Table statistics — are the optimizer statistics stale? Running ANALYZE on the table may restore the original plan. (3) Data volume — did the table grow significantly? A query that was fast on 100K rows may be slow on 1M. (4) Lock contention — is the query waiting for locks held by another process? (5) Resource contention — is another process (backup, migration, batch job) consuming I/O? (6) Index health — is the index fragmented or has it been dropped? AI checks each factor: 'The execution plan changed. Previously used idx_orders_date, now performing sequential scan. Table statistics are stale (last ANALYZE: 3 weeks ago). Running ANALYZE restores the index scan plan' Rebuild all indexes — fragmentation causes slowdowns

3. Your database has 200GB of data and is growing 5GB per month. Disk is at 70% capacity (200GB of 285GB). When should you start planning for more storage?

When it hits 85-90% — you still have months Now. AI capacity planning: at 5GB/month growth, you'll hit 90% in approximately 3.5 months and 100% in 5.5 months. But this is the optimistic scenario. Account for: (1) growth acceleration — if data growth is increasing (e.g., user acquisition is growing), 5GB/month today could be 8GB/month in 3 months, (2) operational overhead — you need 15-20% free space for temp tables, sorts, backups, and migrations. So effective capacity is 230GB, not 285GB, (3) one-time events — a large data import, index rebuild, or table rewrite can consume 10-20GB temporarily. AI projects: 'At current growth rate, operational free space (20%) will be exhausted in approximately 2 months. Recommend: schedule storage expansion within 30 days, and review data retention policy for archivable data' Don't worry about it — cloud storage auto-scales

Answer all questions to check

Complete the quiz above first