Backup & Disaster Recovery

Build comprehensive database backup strategies with AI — automated backups, point-in-time recovery, restore verification, disaster recovery planning, and the tested procedures that protect your data.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

🔄 Quick Recall: In the previous lesson, you built performance monitoring that catches problems early. Now you’ll build the backup and recovery systems that protect your data when things go wrong — because the question isn’t whether a data disaster will happen, it’s when.

An untested backup is not a backup. This isn’t a platitude — companies have discovered during real emergencies that their backups were corrupted, incomplete, or impossible to restore in the required timeframe. AI helps design comprehensive backup strategies with tested restore procedures, so when disaster strikes, you have a verified plan.

Backup Strategy Design

AI prompt for backup strategy:

Design a comprehensive backup strategy for my database. Database: [ENGINE AND VERSION]. Size: [GB]. Growth rate: [GB/MONTH]. RPO (max acceptable data loss): [MINUTES/HOURS]. RTO (max acceptable downtime): [MINUTES/HOURS]. Generate: (1) backup schedule — full backups (frequency, time), incremental/differential backups, continuous log archiving (WAL/binlog), (2) storage plan — where backups go (local, remote, cloud), retention policy (how long to keep each type), encryption requirements, (3) point-in-time recovery setup — how to enable and configure continuous log archiving, (4) monitoring — how to verify backups complete successfully, alert on failures, (5) restore procedure — step-by-step commands for each recovery scenario. Include the cron jobs or automation scripts.

Backup strategy by database size:

Size	Full Backup	Incremental	WAL/Binlog	Retention	Storage Cost Est.
< 10 GB	Daily	Not needed	Continuous	30 days full	$5-15/mo
10-100 GB	Daily	6-hourly	Continuous	14 days full, 30 days incremental	$20-80/mo
100 GB - 1 TB	Weekly full, daily diff	6-hourly	Continuous	7 days full, 30 days diff	$100-400/mo
1 TB+	Weekly full, daily diff	Hourly	Continuous	4 days full, 14 days diff	$400+/mo

Restore Verification

AI prompt for restore testing:

Create a monthly backup restore verification procedure for my database. Database: [ENGINE]. Backup method: [pg_dump/mysqldump/pg_basebackup/xtrabackup/etc.]. Generate: (1) a script that restores the latest full backup to a test server, (2) data integrity checks — compare row counts, check recent data, verify specific known records, (3) PITR test — restore to a specific point in time and verify the correct data is present, (4) timing measurement — record how long the full restore takes (this is your actual RTO), (5) a report template that documents: backup date, restore date, restore duration, integrity check results, any issues found. The procedure should be automated enough to run with minimal manual intervention.

✅ Quick Check: Your monthly restore test reveals: the backup restores successfully, data integrity checks pass, but the restore takes 4 hours. Your SLA requires 1-hour recovery. What do you do? (Answer: This is exactly why you test. Options: (1) switch to a faster backup format — pg_basebackup restores faster than pg_dump for large databases, (2) maintain a hot standby that can be promoted in minutes instead of restoring from backup, (3) use incremental restore — restore last full backup nightly to a standby server, then only apply recent WAL to reach current state. Fix this now, not during an actual emergency.)

Disaster Recovery Planning

AI prompt for DR plan:

Create a disaster recovery plan for my database. Database: [ENGINE], size: [GB], hosted on: [CLOUD PROVIDER/ON-PREMISE]. Scenarios to plan for: (1) hardware failure — single disk, entire server, (2) data corruption — accidental DELETE/UPDATE, application bug, (3) datacenter outage — entire region unavailable, (4) security breach — ransomware, unauthorized access. For each scenario: (a) detection — how do you know it happened, (b) assessment — how severe is it, (c) recovery procedure — step-by-step commands, (d) estimated recovery time, (e) communication plan — who to notify and when. Include a runbook that an on-call engineer can follow at 3am without being a database expert.

Recovery procedure by scenario:

Scenario	Recovery Method	Typical RTO	Data Loss
Accidental DELETE	PITR to before the event	30-60 min	None (with PITR)
Table dropped	PITR or restore table from backup	30-60 min	None (with PITR)
Disk failure	Promote replica or restore backup	5 min (replica) / 1-4 hrs (backup)	Seconds (replica) / RPO (backup)
Server failure	Promote replica or new server + restore	5-15 min (replica) / 2-8 hrs	Seconds / RPO
Data corruption	Identify corruption, PITR to clean state	1-4 hrs	Varies
Ransomware	Clean server + restore offsite backup	4-12 hrs	Up to RPO

Data Retention and Archiving

AI prompt for data lifecycle:

Design a data retention and archiving strategy for my database. Tables: [LIST MAJOR TABLES WITH ROW COUNTS AND GROWTH RATES]. Business requirements: [WHICH DATA MUST BE KEPT AND FOR HOW LONG — e.g., orders for 7 years, logs for 90 days, sessions for 30 days]. Create: (1) a retention policy per table (active storage duration, archive duration, deletion schedule), (2) an archiving process — move old data to cheaper storage while keeping it queryable if needed, (3) partition strategy — use table partitioning to make archiving efficient (drop partition instead of DELETE), (4) the SQL scripts for the archiving job, (5) estimated storage savings and timeline.

Key Takeaways

Storing backups on the same server as the database defeats the purpose — a disk failure, server crash, or ransomware attack destroys both simultaneously. Offsite storage (different server, cloud, or both) is non-negotiable
Point-in-time recovery (PITR) via continuous WAL/binlog archiving is the most valuable backup capability because it handles the most common disaster: human error (accidental DELETE, bad UPDATE). It must be enabled before the disaster — you can’t enable it retroactively
Monthly restore verification is the only way to confirm your backups work, measure actual recovery time, and discover issues (restore takes 4 hours but SLA requires 1 hour) before they matter in an emergency
Disaster recovery plans should be runbooks that an on-call engineer can follow at 3am — step-by-step commands, not high-level procedures that require interpretation under pressure
Data archiving with table partitioning reduces active database size and improves performance — dropping a partition is instant, while DELETE on millions of rows can lock the table for minutes

Up Next

In the next lesson, you’ll build database security hardening and scaling strategies — access control, encryption, read replicas, and the patterns that protect and grow your database.

Knowledge Check

1. Your backup strategy: nightly full backup at 2am, stored on the same server as the database. Is this adequate?

Yes — you have a nightly backup. Data loss is limited to 24 hours Two critical flaws: (1) Storing backups on the same server means a disk failure, server crash, or ransomware attack destroys both the database AND the backup simultaneously. Backups must be stored offsite — a different server, cloud storage, or both. (2) 24-hour RPO (Recovery Point Objective) means you lose an entire day of data in a failure scenario. For most applications, this is unacceptable. Add: continuous WAL/binlog archiving for point-in-time recovery (PITR) — this lets you restore to any moment in time, not just 2am. AI-designed backup strategy: nightly full backup + continuous log shipping to offsite storage + monthly backup restore verification Add a second backup at noon to reduce the data loss window to 12 hours

2. Your last backup restore test was 18 months ago. Your database has grown from 50GB to 200GB since then. Your disaster recovery plan says 'restore from backup — estimated time: 2 hours.' Is this plan still valid?

Yes — the process is the same regardless of size No — restoration time scales with database size, and your 2-hour estimate is based on a database that's now 4× larger. The actual restore time for 200GB could be 6-8 hours, far exceeding your SLA. Additional concerns: (1) the restore process may have changed with database version upgrades, (2) new tables or schemas added in 18 months may not be included in the backup, (3) disk space requirements for the restore (you need 2× the database size temporarily) may exceed the recovery server's capacity. AI recommendation: test restore monthly, update the DR plan after each test with actual restore times, and verify that all databases and schemas are included Test it now — 18 months is too long

3. A developer accidentally runs DELETE FROM orders WHERE status = 'completed' — without a WHERE clause on date, deleting 3 years of completed orders. They notice 45 minutes later. What's your recovery path?

Restore from last night's backup — you'll lose today's data Point-in-time recovery (PITR) to 45 minutes ago: (1) restore the database from the last full backup to a separate server, (2) replay WAL/binlog up to the moment BEFORE the DELETE statement (you need the exact timestamp or LSN), (3) extract the deleted orders from the restored database, (4) insert them back into production. This recovers the deleted data without losing ANY of today's other transactions. Without PITR (only nightly backups): you'd restore last night's backup and lose everything from today — orders placed, payments processed, all of it. PITR is the difference between a 30-minute incident and a catastrophic data loss. AI generates the PITR procedure with the exact commands for your database engine Check the application's soft-delete — maybe the data isn't really gone

Answer all questions to check

Complete the quiz above first

Related Skills

Database Migration SQL Data Analyst