Architecture Decisions and System Design
Use AI to reason through architecture trade-offs, evaluate technology choices, and design systems that scale. Make better decisions, faster.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skills included
- New content added weekly
The Decision That Haunts You
In the previous lesson, we explored documentation and knowledge sharing. Now let’s build on that foundation. Every developer has one: an architecture decision that seemed right at the time but turned into a nightmare. The microservices migration that tripled complexity. The NoSQL database that couldn’t handle the reporting requirements. The “simple” event-driven architecture that created debugging hell.
These decisions are expensive to reverse. A wrong technology choice can cost months of engineering time. And the worst part? You usually don’t know it’s wrong until you’re knee-deep in production issues.
AI won’t make these decisions for you—and you shouldn’t want it to. But it can do something incredibly valuable: lay out trade-offs you hadn’t considered, surface failure modes you’d only discover in production, and help you think through the second-order effects of each option.
The Architecture Decision Framework
Here’s how to use AI for architecture decisions:
Step 1: Define the Problem Space
We need to design the notification system for our SaaS app.
Context:
- 50,000 active users currently, growing 20% monthly
- Notification types: email, in-app, push, SMS
- Users configure preferences per notification type
- Some notifications are time-critical (security alerts)
- Some are batched (daily digest)
- Must support delivery tracking and retry logic
- Team: 4 backend developers, all strong in Python
- Current stack: Django, PostgreSQL, Redis, AWS
What are the key architecture decisions we need to make?
AI identifies the decision points:
- Synchronous vs. asynchronous processing
- Queue technology choice
- Database for notification state
- Delivery service (build vs. buy)
- Retry and failure handling strategy
Step 2: Explore Options with Trade-offs
For each decision point, ask AI to present options:
For our notification queue, compare these options
given our constraints (Python team, AWS, 50K users
growing 20%/month):
1. Celery with Redis
2. AWS SQS with Lambda
3. RabbitMQ
4. Kafka
For each option, analyze:
- Complexity to implement and maintain
- Scalability ceiling
- Failure modes and recovery
- Operational cost at our scale
- Learning curve for our Python team
- What happens when we hit 500K users?
AI produces a thorough comparison:
| Factor | Celery + Redis | SQS + Lambda | RabbitMQ | Kafka |
|---|---|---|---|---|
| Complexity | Low (team knows it) | Medium (new paradigm) | Medium | High (overkill?) |
| Scale ceiling | ~100K msgs/min | Virtually unlimited | ~500K msgs/min | Millions/sec |
| Failure handling | Basic retry, manual DLQ | Built-in DLQ, auto-retry | Robust, flexible | Excellent |
| Cost at 50K users | ~$50/mo (Redis) | ~$20/mo (pay per use) | ~$100/mo (server) | ~$300/mo (cluster) |
| At 500K users | Needs rearchitecting | Scales automatically | Needs larger cluster | Still comfortable |
| Team ramp-up | 0 (already using) | 2-3 weeks | 1-2 weeks | 4-6 weeks |
This table alone saves hours of research. But the real value is in the nuance AI adds—like noting that Celery + Redis is the fastest to ship but has a known issue with message acknowledgment under heavy load, or that SQS + Lambda introduces cold start latency that matters for time-critical security alerts.
Quick Check: Spot the Missing Trade-off
An AI recommends: “Use MongoDB for your notification storage because it handles flexible schemas and scales horizontally.”
What trade-off is missing from this recommendation? Think about what notifications need that MongoDB isn’t optimized for.
The missing piece: notifications involve heavy query patterns (show me unread notifications for user X, sorted by time). MongoDB can handle this, but it requires careful index management. PostgreSQL with proper indexes might actually be simpler for this access pattern, and the team already knows it. “Flexible schema” sounds great, but the notification schema is actually quite stable.
Architecture Decision Records (ADRs)
Every significant decision should be documented. AI makes this painless:
Help me write an Architecture Decision Record for our
notification queue decision.
Decision: We chose Celery + Redis for Phase 1, with a
migration path to SQS for Phase 2.
Context: [paste the problem description from Step 1]
Options considered:
1. Celery + Redis
2. SQS + Lambda
3. RabbitMQ
4. Kafka
Decision drivers:
- Fastest time to market (team already knows Celery)
- Adequate for current scale
- Clear migration path as we grow
Format: Follow the MADR (Markdown Any Decision Record) format.
AI produces a structured ADR that you can review, adjust, and commit to your repo. The ADR serves as institutional memory—six months from now, when someone asks “why didn’t we use Kafka?”, the answer is documented.
System Design with AI
For new system design, AI works best as a thinking partner. Here’s a real scenario:
We're designing a real-time collaborative document editor
(similar concept to Google Docs). Help me think through
the architecture.
Constraints:
- Must support 50 concurrent editors per document
- Changes visible within 200ms to other editors
- Must handle conflict resolution
- Offline editing with sync on reconnect
- Document size up to 100 pages
Questions I'm wrestling with:
1. CRDTs vs. Operational Transformation for conflict resolution?
2. WebSocket architecture for real-time sync?
3. How to handle the offline/sync scenario?
4. Storage strategy for version history?
AI won’t design the entire system for you, but it’ll walk through each question with context-specific analysis. For CRDTs vs. OT, it’ll explain that CRDTs are simpler to reason about but produce larger payloads, while OT is more efficient but harder to implement correctly. It’ll reference real-world examples—Google Docs uses OT, Figma uses CRDTs.
The key phrase to use: “What am I not thinking about?”
Based on the architecture we've discussed, what failure
modes or edge cases am I not thinking about?
AI might surface:
- “What happens when two users go offline, both edit the same paragraph, then come back online simultaneously?”
- “How do you handle a user with a very slow connection who’s 30 seconds behind the live document?”
- “What’s your strategy for very long documents where loading the entire CRDT state takes seconds?”
These are the questions that prevent 2 AM production fires.
Technology Evaluation
When evaluating a new technology for your stack:
We're considering adding GraphQL to our existing REST API.
Current situation:
- 30 REST endpoints serving a React frontend
- Mobile app launching in 3 months (needs same data, different shapes)
- Team of 6, no GraphQL experience
- Using Express.js, PostgreSQL
Evaluate GraphQL for our situation:
1. What specific problems would it solve?
2. What new problems would it introduce?
3. What's the migration path from our current REST API?
4. What's the realistic timeline including learning curve?
5. Are there alternatives that solve the same problems with less disruption?
AI gives you a balanced assessment. It might point out that the “different data shapes for mobile” problem could also be solved with REST + field selection (like JSON:API sparse fieldsets), which requires zero learning curve. GraphQL is great, but it’s not the only solution—and AI helps you see alternatives.
Practical Exercise: Document a Decision
Think about an architecture decision you’ve recently made (or need to make). Try this:
- Describe the problem and constraints to your AI assistant
- Ask for 3 options with trade-off analysis
- Ask “what am I not thinking about?” for each option
- Generate an ADR for your chosen (or preferred) option
- Review the ADR—did AI capture trade-offs you’d want future team members to know?
This exercise takes about 15 minutes and produces documentation that would have taken an hour or more.
Key Takeaways
- Use AI to lay out trade-offs, not to make decisions for you
- Always ask about failure modes and second-order effects
- Request multiple options with explicit trade-off comparison
- “What am I not thinking about?” is your most powerful architecture prompt
- Document decisions with ADRs—AI makes this painless
- Technology evaluation should include alternatives, not just the trendy option
Next up: the capstone. You’ll build a complete feature end-to-end using AI at every stage—generation, debugging, testing, review, documentation, and architecture decisions. Everything comes together.
Up next: In the next lesson, we’ll dive into Capstone: Build a Feature End-to-End with AI.
Knowledge Check
Complete the quiz above first
Lesson completed!