AI Performance Monitoring
Build AI-powered database performance monitoring — baseline establishment, slow query detection, anomaly alerts, capacity planning, and proactive optimization that prevents outages.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
🔄 Quick Recall: In the previous lesson, you built safe migration workflows. Now you’ll build the monitoring systems that detect performance problems before they become outages — because the best optimization is the one you make before users notice.
Database performance degrades gradually, then suddenly. A query that worked fine at 100K rows crawls at 1M. A table that was fast to write becomes slow as indexes multiply. Connection pools that were sufficient at launch exhaust at 2× traffic. AI monitoring establishes baselines, detects deviations, and explains causes — turning reactive firefighting into proactive optimization.
Baseline Establishment
AI prompt for performance baseline:
Establish a performance baseline for my database. Database: [ENGINE AND VERSION]. Current metrics: CPU usage: [%], memory usage: [%], disk I/O: [MB/s], active connections: [NUMBER], queries per second: [QPS]. Top 10 slowest queries: [LIST WITH EXECUTION TIMES]. Table sizes: [LIST LARGEST TABLES WITH ROW COUNTS]. For each metric: (1) establish the normal range (mean ± 2 standard deviations), (2) define warning threshold (approaching concern), (3) define critical threshold (requires immediate action), (4) identify the most likely cause when this metric spikes. Create a monitoring dashboard specification with these metrics, thresholds, and alerting rules.
Key database metrics to monitor:
| Metric | Normal | Warning | Critical | Likely Cause |
|---|---|---|---|---|
| CPU usage | 20-40% | > 60% | > 85% | Bad query, missing index, lock contention |
| Memory usage | 60-80% | > 85% | > 95% | Buffer pool too small, memory leak, too many connections |
| Disk I/O | Baseline + 20% | Baseline + 50% | Baseline + 100% | Full table scans, backup running, large sort operations |
| Active connections | Pool size × 60% | Pool size × 80% | Pool size × 95% | Connection leak, slow queries holding connections |
| Query latency (p95) | < 100ms | > 500ms | > 2s | Missing index, data growth, lock contention |
| Replication lag | < 1s | > 5s | > 30s | Heavy writes, network issues, replica capacity |
Slow Query Monitoring
AI prompt for slow query analysis:
Analyze my database slow query log for the past [TIME PERIOD]. Log data: [PASTE OR DESCRIBE — queries, execution times, frequency]. Perform: (1) Top 10 queries by total time (frequency × average execution time) — these are your optimization priorities, (2) queries that recently got slower — potential plan regressions or data growth issues, (3) queries that appear in bursts — potential batch jobs or N+1 patterns, (4) queries with high variance (sometimes 50ms, sometimes 5s) — potential lock contention or resource competition. For each flagged query: the specific optimization recommendation with expected improvement.
✅ Quick Check: Your slow query log shows one query consuming 40% of total database time:
SELECT * FROM events WHERE user_id = ? AND created_at > ? ORDER BY created_at DESC LIMIT 50. It runs 500 times/day averaging 800ms. What’s the first thing to check? (Answer: Does a composite index exist on (user_id, created_at)? This query filters on user_id and sorts by created_at — a composite index in that order serves both operations. Without it, the database scans all events for a user, then sorts. With it, the database reads the 50 most recent events directly from the index. Expected improvement: 800ms → 5-20ms.)
Anomaly Detection
AI prompt for anomaly alerting:
Design an anomaly detection system for my database monitoring. Metrics: [LIST YOUR KEY METRICS WITH NORMAL BASELINES]. Create alerting rules that: (1) detect deviations from baseline (not just threshold breaches) — a database normally at 30% CPU alerting at 50% is more useful than a fixed 80% threshold, (2) correlate metrics — CPU spike + slow query log entries suggests a query problem, CPU spike + connection spike suggests a traffic problem, (3) suppress false positives — scheduled backups, known maintenance windows, and expected traffic patterns (Monday morning spike) shouldn’t trigger alerts, (4) escalate appropriately — warning alerts go to the on-call channel, critical alerts page the DBA. Define the alert message format: what metric deviated, the current and baseline values, correlated events, and suggested investigation steps.
Capacity Planning
AI prompt for capacity forecasting:
Create a capacity plan for my database. Current state: storage: [USED/TOTAL], CPU: [AVERAGE %], memory: [USED/TOTAL], connections: [AVERAGE/MAX], QPS: [CURRENT]. Growth trends: [MONTHLY DATA GROWTH, TRAFFIC GROWTH RATE]. Forecast: (1) when each resource will hit warning and critical thresholds at current growth, (2) when each resource will hit thresholds if growth accelerates by 50% (new feature launch, marketing push), (3) recommended actions with timelines — which resources need attention first and when, (4) cost-effective scaling options — vertical (bigger server) vs. horizontal (read replicas, sharding) vs. optimization (reduce data with archiving, reduce load with caching). Present a 6-month forecast with monthly checkpoints.
Key Takeaways
- Baseline-relative monitoring (alert when CPU rises from normal 30% to 50%) catches issues far earlier than fixed-threshold monitoring (alert when CPU hits 80%) — AI establishes baselines and detects deviations that indicate emerging problems
- When a query suddenly slows without code changes, the cause is usually stale optimizer statistics, data growth, or resource contention — AI checks each systematically instead of guessing, and running ANALYZE on affected tables fixes the most common cause
- Slow query prioritization by total cost (frequency × execution time) focuses optimization on the queries that consume the most database resources — one 800ms query running 500 times/day costs more than one 5-second query running twice
- Capacity planning should start at 70% resource utilization, accounting for growth acceleration, operational overhead (15-20% for temp tables and migrations), and one-time events — AI projects realistic timelines that simple linear projections miss
- Metric correlation (CPU spike + new query pattern, or connection spike + slow queries) identifies root causes faster than looking at individual metrics — AI connects the “what” (metric anomaly) to the “why” (specific query or event)
Up Next
In the next lesson, you’ll build backup and recovery strategies — verified backup procedures, point-in-time recovery, and the disaster recovery plans that protect your data.
Knowledge Check
Complete the quiz above first
Lesson completed!