Architecture Decisions and System Design

The Decision That Haunts You

In the previous lesson, we explored documentation and knowledge sharing. Now let’s build on that foundation. Every developer has one: an architecture decision that seemed right at the time but turned into a nightmare. The microservices migration that tripled complexity. The NoSQL database that couldn’t handle the reporting requirements. The “simple” event-driven architecture that created debugging hell.

These decisions are expensive to reverse. A wrong technology choice can cost months of engineering time. And the worst part? You usually don’t know it’s wrong until you’re knee-deep in production issues.

AI won’t make these decisions for you—and you shouldn’t want it to. But it can do something incredibly valuable: lay out trade-offs you hadn’t considered, surface failure modes you’d only discover in production, and help you think through the second-order effects of each option.

The Architecture Decision Framework

Here’s how to use AI for architecture decisions:

Step 1: Define the Problem Space

We need to design the notification system for our SaaS app.

Context:
- 50,000 active users currently, growing 20% monthly
- Notification types: email, in-app, push, SMS
- Users configure preferences per notification type
- Some notifications are time-critical (security alerts)
- Some are batched (daily digest)
- Must support delivery tracking and retry logic
- Team: 4 backend developers, all strong in Python
- Current stack: Django, PostgreSQL, Redis, AWS

What are the key architecture decisions we need to make?

AI identifies the decision points:

Synchronous vs. asynchronous processing
Queue technology choice
Database for notification state
Delivery service (build vs. buy)
Retry and failure handling strategy

Step 2: Explore Options with Trade-offs

For each decision point, ask AI to present options:

For our notification queue, compare these options
given our constraints (Python team, AWS, 50K users
growing 20%/month):

1. Celery with Redis
2. AWS SQS with Lambda
3. RabbitMQ
4. Kafka

For each option, analyze:
- Complexity to implement and maintain
- Scalability ceiling
- Failure modes and recovery
- Operational cost at our scale
- Learning curve for our Python team
- What happens when we hit 500K users?

AI produces a thorough comparison:

Factor	Celery + Redis	SQS + Lambda	RabbitMQ	Kafka
Complexity	Low (team knows it)	Medium (new paradigm)	Medium	High (overkill?)
Scale ceiling	~100K msgs/min	Virtually unlimited	~500K msgs/min	Millions/sec
Failure handling	Basic retry, manual DLQ	Built-in DLQ, auto-retry	Robust, flexible	Excellent
Cost at 50K users	~$50/mo (Redis)	~$20/mo (pay per use)	~$100/mo (server)	~$300/mo (cluster)
At 500K users	Needs rearchitecting	Scales automatically	Needs larger cluster	Still comfortable
Team ramp-up	0 (already using)	2-3 weeks	1-2 weeks	4-6 weeks

This table alone saves hours of research. But the real value is in the nuance AI adds—like noting that Celery + Redis is the fastest to ship but has a known issue with message acknowledgment under heavy load, or that SQS + Lambda introduces cold start latency that matters for time-critical security alerts.

Quick Check: Spot the Missing Trade-off

An AI recommends: “Use MongoDB for your notification storage because it handles flexible schemas and scales horizontally.”

What trade-off is missing from this recommendation? Think about what notifications need that MongoDB isn’t optimized for.

The missing piece: notifications involve heavy query patterns (show me unread notifications for user X, sorted by time). MongoDB can handle this, but it requires careful index management. PostgreSQL with proper indexes might actually be simpler for this access pattern, and the team already knows it. “Flexible schema” sounds great, but the notification schema is actually quite stable.

Architecture Decision Records (ADRs)

Every significant decision should be documented. AI makes this painless:

Help me write an Architecture Decision Record for our
notification queue decision.

Decision: We chose Celery + Redis for Phase 1, with a
migration path to SQS for Phase 2.

Context: [paste the problem description from Step 1]

Options considered:
1. Celery + Redis
2. SQS + Lambda
3. RabbitMQ
4. Kafka

Decision drivers:
- Fastest time to market (team already knows Celery)
- Adequate for current scale
- Clear migration path as we grow

Format: Follow the MADR (Markdown Any Decision Record) format.

AI produces a structured ADR that you can review, adjust, and commit to your repo. The ADR serves as institutional memory—six months from now, when someone asks “why didn’t we use Kafka?”, the answer is documented.

System Design with AI

For new system design, AI works best as a thinking partner. Here’s a real scenario:

We're designing a real-time collaborative document editor
(similar concept to Google Docs). Help me think through
the architecture.

Constraints:
- Must support 50 concurrent editors per document
- Changes visible within 200ms to other editors
- Must handle conflict resolution
- Offline editing with sync on reconnect
- Document size up to 100 pages

Questions I'm wrestling with:
1. CRDTs vs. Operational Transformation for conflict resolution?
2. WebSocket architecture for real-time sync?
3. How to handle the offline/sync scenario?
4. Storage strategy for version history?

AI won’t design the entire system for you, but it’ll walk through each question with context-specific analysis. For CRDTs vs. OT, it’ll explain that CRDTs are simpler to reason about but produce larger payloads, while OT is more efficient but harder to implement correctly. It’ll reference real-world examples—Google Docs uses OT, Figma uses CRDTs.

The key phrase to use: “What am I not thinking about?”

Based on the architecture we've discussed, what failure
modes or edge cases am I not thinking about?

AI might surface:

“What happens when two users go offline, both edit the same paragraph, then come back online simultaneously?”
“How do you handle a user with a very slow connection who’s 30 seconds behind the live document?”
“What’s your strategy for very long documents where loading the entire CRDT state takes seconds?”

These are the questions that prevent 2 AM production fires.

Technology Evaluation

When evaluating a new technology for your stack:

We're considering adding GraphQL to our existing REST API.

Current situation:
- 30 REST endpoints serving a React frontend
- Mobile app launching in 3 months (needs same data, different shapes)
- Team of 6, no GraphQL experience
- Using Express.js, PostgreSQL

Evaluate GraphQL for our situation:
1. What specific problems would it solve?
2. What new problems would it introduce?
3. What's the migration path from our current REST API?
4. What's the realistic timeline including learning curve?
5. Are there alternatives that solve the same problems with less disruption?

AI gives you a balanced assessment. It might point out that the “different data shapes for mobile” problem could also be solved with REST + field selection (like JSON:API sparse fieldsets), which requires zero learning curve. GraphQL is great, but it’s not the only solution—and AI helps you see alternatives.

Practical Exercise: Document a Decision

Think about an architecture decision you’ve recently made (or need to make). Try this:

Describe the problem and constraints to your AI assistant
Ask for 3 options with trade-off analysis
Ask “what am I not thinking about?” for each option
Generate an ADR for your chosen (or preferred) option
Review the ADR—did AI capture trade-offs you’d want future team members to know?

This exercise takes about 15 minutes and produces documentation that would have taken an hour or more.

Key Takeaways

Use AI to lay out trade-offs, not to make decisions for you
Always ask about failure modes and second-order effects
Request multiple options with explicit trade-off comparison
“What am I not thinking about?” is your most powerful architecture prompt
Document decisions with ADRs—AI makes this painless
Technology evaluation should include alternatives, not just the trendy option

Next up: the capstone. You’ll build a complete feature end-to-end using AI at every stage—generation, debugging, testing, review, documentation, and architecture decisions. Everything comes together.

Up next: In the next lesson, we’ll dive into Capstone: Build a Feature End-to-End with AI.