Lesson 6 12 min

Debugging in Production

Learn to debug production issues — structured logging, error tracking, monitoring, and AI-powered log analysis when you can't reproduce bugs locally.

🔄 Recall Bridge: In the previous lesson, you learned common bug patterns — off-by-one, null references, race conditions, and async errors. Now let’s debug in the environment where you can’t set breakpoints or add print statements: production.

Production debugging is different from local debugging. You can’t attach a debugger, add print statements, or reproduce the issue on demand. Your tools are logs, error tracking, monitoring, and the evidence left behind. This is where structured logging and AI-powered analysis become essential.

Structured Logging

Unstructured log (hard to analyze):

Error processing request for user John

Structured log (queryable and analyzable):

{"level": "error", "timestamp": "2026-02-25T08:15:23Z", "service": "api",
 "action": "process_order", "user_id": "usr_123", "order_id": "ord_456",
 "error": "InsufficientBalance", "balance": 45.50, "required": 89.99,
 "duration_ms": 234}

What to log at each level:

LevelLog WhenInclude
ERROROperation failedException, stack trace, context (user, request, data)
WARNSomething unexpected but handledWhat was unexpected, what fallback was used
INFOImportant business eventsRequest received/completed, key milestones, timing
DEBUGDetailed diagnostic infoVariable values, decision branches (disable in production)

Error Tracking

Error tracking tools (Sentry, Bugsnag, Datadog) capture exceptions with context:

Data CapturedWhy It Matters
Stack traceWhere in code the error occurred
Request contextWhat input triggered it
User infoWho was affected (anonymized)
EnvironmentBrowser, OS, API version
FrequencyHow many users affected, trending up or down?
First/last occurrenceWhen did this start? Correlates with deployments?

AI prompt for error investigation:

My error tracking shows this recurring error: [ERROR MESSAGE + STACK TRACE]. It affects [N] users per hour. It started [DATE]. Here’s a recent deployment changelog: [CHANGES]. Analyze: (1) Which code change likely introduced this? (2) What’s the probable root cause? (3) What should I investigate first?

AI-Powered Log Analysis

AI prompt for pattern finding:

Here are logs from 50 failed requests: [PASTE LOGS]. And here are 10 successful requests for comparison: [PASTE]. Identify: (1) What patterns do the failed requests share? (2) What’s different between failed and successful requests? (3) What’s the most likely root cause? (4) What additional logging would help diagnose this further?

Debugging Without Reproduction

When you can’t reproduce a bug locally:

TechniqueWhen to Use
Log analysisCompare failed vs. successful request logs
Error trackingLook at stack trace, context, frequency
Feature flagsGradually enable/disable features to isolate
Canary deploymentDeploy fix to small percentage, monitor
Correlation analysisCheck if failures correlate with time, user type, region

Quick Check: Your monitoring shows that error rate spiked from 0.1% to 5% at exactly 2:15 PM today. What’s the fastest way to identify the cause? (Answer: Check what changed at 2:15 PM: (1) Was there a deployment? Check CI/CD pipeline. (2) Did a dependency update? Check package versions. (3) Did an external service change? Check third-party status pages. (4) Did traffic pattern change? Check load metrics. The exact timing narrows the investigation dramatically — correlate the spike with events rather than debugging the errors individually.)

Key Takeaways

  • Structured logging (JSON format with timestamp, level, context, duration) is the foundation of production debugging — unstructured text logs are nearly impossible to search and analyze at scale, while structured logs enable AI analysis and automated pattern detection
  • Error tracking tools capture what print statements can’t: stack traces with full context (user, request, environment), frequency trends, and correlation with deployments — generic “Internal Server Error” messages mean your error tracking needs more detail, not that the bugs are mysterious
  • AI excels at log analysis: paste failed and successful request logs together and AI finds patterns (shared characteristics, missing fields, timing correlations) that would take hours to spot manually in thousands of log lines

Up Next

In the next lesson, you’ll learn root cause analysis — the 5 Whys technique for finding and fixing the underlying problem instead of just patching symptoms.

Knowledge Check

1. Users report that your app is 'slow' in production. Your local environment is fast. You have basic logging that says 'request received' and 'response sent.' How do you diagnose the performance issue?

2. Your error tracking tool (Sentry, Bugsnag, etc.) shows 500 errors per hour on one endpoint. The error message is 'Internal Server Error.' That's all the information you have. What's your first step?

3. You have logs from 10,000 requests. 50 of them failed with different error messages. You need to find the common pattern. How does AI help here?

Answer all questions to check

Complete the quiz above first

Related Skills