Data Storage & Database Design
Choose the right database for every use case — SQL vs. NoSQL trade-offs, schema design, sharding strategies, replication, consistency models, and the data decisions that determine system scalability.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
🔄 Quick Recall: In the previous lesson, you learned the building blocks — load balancers, caches, CDNs, and message queues. Now you’ll dive into data storage — the database decisions that more than anything else determine your system’s scalability, consistency, and reliability.
Database design is where system design interviews get deep. The interviewer wants to see that you can choose the right database for each use case, design schemas that scale, and make explicit trade-offs between consistency, availability, and performance.
SQL vs. NoSQL Decision Framework
AI prompt for database selection:
Help me choose the right database for [USE CASE]. Data characteristics: [DESCRIBE — structured/unstructured, relationships, query patterns, write/read ratio]. Scale: [TRAFFIC AND STORAGE ESTIMATES]. Consistency requirements: [STRONG/EVENTUAL]. Compare: (1) SQL (PostgreSQL/MySQL) — when it’s the right choice and why, (2) Document store (MongoDB) — when it’s the right choice, (3) Wide-column (Cassandra/HBase) — when it’s the right choice, (4) Key-value (Redis/DynamoDB) — when it’s the right choice. For each: the specific strength for my use case, the specific weakness, and the scale threshold where it becomes problematic.
Database selection matrix:
| Use Case | Best Fit | Why | Not Ideal |
|---|---|---|---|
| User profiles, orders | SQL (PostgreSQL) | ACID, relationships, complex queries | > 10TB or > 50K writes/sec |
| Product catalog | Document (MongoDB) | Flexible schema, nested attributes | Complex joins, strict consistency |
| Activity feed, time-series | Wide-column (Cassandra) | High write throughput, time-based partitioning | Complex queries, transactions |
| Session data, cache | Key-value (Redis) | Sub-ms latency, simple lookups | Complex queries, large data sets |
| Search | Elasticsearch | Full-text search, faceting, ranking | Not a primary data store |
| Relationships (social graph) | Graph (Neo4j) | Traverse connections efficiently | Scaling, general-purpose queries |
Sharding Strategies
AI prompt for sharding design:
Design a sharding strategy for [TABLE/COLLECTION] with [X] rows and [Y] QPS. Access patterns: [DESCRIBE — which columns are queried most, which queries must be fast]. Compare: (1) Hash-based sharding — on which key, how many shards, (2) Range-based sharding — on which field, partition boundaries, (3) Geography-based sharding — if applicable. For each: which queries are fast (single-shard), which are slow (scatter-gather), and how to handle cross-shard queries.
✅ Quick Check: You shard an orders table by order_id. The query “show all orders for user_123” must hit ALL shards because user orders are distributed randomly. What’s the fix? (Answer: If “orders by user” is a frequent query, shard by user_id instead of order_id. All orders for user_123 are on the same shard. The trade-off: looking up a single order by order_id now requires knowing which user placed it, or hitting all shards. Solution: maintain a secondary index mapping order_id→user_id.)
Replication and Consistency
AI prompt for replication design:
Design the replication strategy for [DATABASE] in my system. Consistency requirements: [STRONG/EVENTUAL per data type]. Availability requirements: [UPTIME TARGET]. Read/write distribution: [RATIO AND GEOGRAPHIC SPREAD]. Compare: (1) Single-leader replication — one writer, multiple readers, (2) Multi-leader replication — multiple writers in different regions, (3) Leaderless replication — any node can accept writes. For each: consistency guarantees, latency impact, and failure behavior.
Data Modeling
AI prompt for schema design:
Design the data model for [SYSTEM]. Entities: [LIST]. Relationships: [DESCRIBE]. Access patterns: [MOST COMMON QUERIES]. Generate: (1) the schema — tables/collections with fields, types, and indexes, (2) denormalization decisions — what to duplicate for read performance and why, (3) the trade-off between normalization (no duplication, complex joins) and denormalization (duplication, simple reads).
Key Takeaways
- Polyglot persistence (different databases for different data types) is the right answer at scale: user profiles in PostgreSQL (consistency), news feed in Redis+Cassandra (throughput), search in Elasticsearch (full-text). State the trade-off: operational complexity vs. performance optimization
- The sharding key determines which queries are fast and which are slow: choose based on your most important access patterns. Hash sharding distributes evenly but breaks range queries; range sharding enables range queries but can create hot partitions
- The Transactional Outbox pattern solves cross-system consistency without distributed transactions: write the operation and the message to the same database atomically, then a separate process publishes the message
- Always state the consistency model explicitly: “This data uses strong consistency because a user updating their name must see the change immediately” vs. “This uses eventual consistency because seeing a new post 2 seconds late is acceptable”
- AI helps you practice the database selection conversation: describe your use case and AI walks through the options with specific trade-offs, just like a senior engineer in a design review
Up Next
In the next lesson, you’ll learn distributed systems concepts — CAP theorem, consensus, event-driven architecture, and the patterns that make systems reliable at scale.