Lesson 7 12 min

Root Cause Analysis

Master root cause analysis with the 5 Whys technique — stop patching symptoms and fix the underlying problems that cause bugs to keep coming back.

🔄 Recall Bridge: In the previous lesson, you learned production debugging — structured logging, error tracking, and AI-powered log analysis. Now let’s go deeper: instead of just fixing what broke, find out WHY it broke.

Most bugs are symptoms of deeper problems. Fixing the symptom feels productive — the error goes away, the user is happy. But if you don’t fix the root cause, the bug comes back. Or worse: it comes back as a different bug that’s harder to diagnose. Root cause analysis is the skill that separates “I fixed it” from “I fixed it permanently.”

The 5 Whys Technique

Ask “Why?” repeatedly until you reach a cause you can fix permanently:

Example: Users see a blank page after login

Why #QuestionAnswer
1Why is the page blank?The dashboard component crashes on render
2Why does it crash?user.preferences is undefined
3Why is it undefined?The API returns null for new users
4Why does the API return null?New user records don’t have a preferences row in the database
5Why is there no preferences row?The signup flow creates the user but doesn’t initialize default preferences

Root cause: Signup flow missing default preferences initialization

Symptom fix (temporary): Add user.preferences || {} null check in the dashboard

Root cause fix (permanent): Add default preferences creation to the signup flow AND backfill existing users

When to Stop Asking Why

Stop WhenExample
You reach a process gap“No automated test covers this path”
You reach a design flaw“The API doesn’t validate this input”
You reach a missing requirement“We never specified what happens for new users”
You reach an infrastructure issue“The database lacks a constraint for this”

Don’t stop at: “Someone made a mistake” — that’s blame, not analysis. Ask “Why was it possible to make this mistake?” to find the process gap.

AI-Assisted Root Cause Analysis

AI prompt for 5 Whys:

I have a bug: [DESCRIBE]. The immediate cause is [WHAT YOU FOUND]. Help me do a 5 Whys analysis: keep asking “why” until we reach a root cause that can be fixed permanently. For each “why,” suggest what evidence I should look for to confirm the answer. At the end, recommend both: (1) a quick symptom fix for immediate relief, (2) a root cause fix for permanent resolution.

Symptom Fix vs. Root Cause Fix

Symptom FixRoot Cause Fix
SpeedMinutes to hoursHours to days
ScopePatches this specific failurePrevents this category of failure
RiskLow — small changeModerate — larger change
DurabilityTemporary — bug may returnPermanent — bug class eliminated
WhenProduction is burningAfter stabilization

The correct approach is usually both: Apply a symptom fix to stop the bleeding, then schedule the root cause fix immediately.

Common Root Cause Categories

CategoryExampleSystemic Fix
Missing validationAPI accepts invalid dataAdd input validation + schema enforcement
Missing default valuesNull crashes when data absentInitialize defaults at creation time
Race conditionTiming-dependent failuresRedesign to eliminate timing dependency
Missing error handlingUnhandled edge case crashes appAdd comprehensive error boundaries
Incomplete migrationOld data format causes crashesBackfill + validate all records
Missing test coverageBug in untested code pathAdd tests for this and similar paths

Preventing Recurring Bugs

After every root cause fix, add prevention:

PreventionHow
Regression testTest that fails with the old bug, passes with the fix
Automated checkLinting rule, CI check, or database constraint
DocumentationAdd to team knowledge base or runbook
MonitoringAlert that detects this category of failure

AI prompt for prevention:

I fixed a root cause: [DESCRIBE FIX]. Now help me prevent this category of bug from ever happening again. Suggest: (1) A regression test that catches this specific bug. (2) An automated check that catches SIMILAR bugs. (3) A monitoring alert that detects this failure pattern in production.

Quick Check: A function crashes when processing a list with duplicate items. The quick fix is list(set(items)) to remove duplicates. What’s the root cause question you should ask? (Answer: “Why does the list have duplicates in the first place?” Maybe the data source is sending duplicates (fix: deduplicate at ingestion). Maybe the query joins tables incorrectly (fix: fix the query). Maybe the user submitted the form twice (fix: add idempotency). Removing duplicates at processing time is a symptom fix — the duplicates will keep appearing until you fix how they’re created.)

Key Takeaways

  • The 5 Whys technique drills past symptoms to root causes — each “why” moves one layer deeper, and you stop when you reach a process gap, design flaw, or missing requirement that can be fixed permanently
  • Root cause analysis targets processes, not people: “the code review didn’t catch it” leads to actionable process improvements (checklists, automated tests, linting rules), while “the developer wrote bad code” leads only to blame with no systemic fix
  • Use both symptom fixes AND root cause fixes: stabilize production immediately with a quick patch, then fix the root cause within days — the most dangerous outcome is a symptom fix that holds just long enough for everyone to forget the underlying problem

Up Next

In the final lesson, you’ll build your personal debugging playbook — combining everything from this course into a reusable system you can apply to any bug.

Knowledge Check

1. Your team fixes a bug: users couldn't log in because the authentication token expired too quickly. The fix: increase token expiration from 15 minutes to 60 minutes. Two weeks later, users report the same login failure. What went wrong with the original fix?

2. You're applying the 5 Whys to a production outage. After 'Why #3,' your team disagrees — one person says the root cause is 'the developer wrote bad code' and another says 'the code review didn't catch the bug.' Which framing is more useful?

3. You've completed root cause analysis and found the real issue. You also have 3 other bugs reported this week. Should you fix the root cause now or patch the symptom and fix it later?

Answer all questions to check

Complete the quiz above first

Related Skills