Lesson 6 12 min

Debugging in Production

Learn to debug production issues — structured logging, error tracking, monitoring, and AI-powered log analysis when you can't reproduce bugs locally.

🔄 Recall Bridge: In the previous lesson, you learned common bug patterns — off-by-one, null references, race conditions, and async errors. Now let’s debug in the environment where you can’t set breakpoints or add print statements: production.

Production debugging is different from local debugging. You can’t attach a debugger, add print statements, or reproduce the issue on demand. Your tools are logs, error tracking, monitoring, and the evidence left behind. This is where structured logging and AI-powered analysis become essential.

Structured Logging

Unstructured log (hard to analyze):

Error processing request for user John

Structured log (queryable and analyzable):

{"level": "error", "timestamp": "2026-02-25T08:15:23Z", "service": "api",
 "action": "process_order", "user_id": "usr_123", "order_id": "ord_456",
 "error": "InsufficientBalance", "balance": 45.50, "required": 89.99,
 "duration_ms": 234}

What to log at each level:

Level	Log When	Include
ERROR	Operation failed	Exception, stack trace, context (user, request, data)
WARN	Something unexpected but handled	What was unexpected, what fallback was used
INFO	Important business events	Request received/completed, key milestones, timing
DEBUG	Detailed diagnostic info	Variable values, decision branches (disable in production)

Error Tracking

Error tracking tools (Sentry, Bugsnag, Datadog) capture exceptions with context:

Data Captured	Why It Matters
Stack trace	Where in code the error occurred
Request context	What input triggered it
User info	Who was affected (anonymized)
Environment	Browser, OS, API version
Frequency	How many users affected, trending up or down?
First/last occurrence	When did this start? Correlates with deployments?

AI prompt for error investigation:

My error tracking shows this recurring error: [ERROR MESSAGE + STACK TRACE]. It affects [N] users per hour. It started [DATE]. Here’s a recent deployment changelog: [CHANGES]. Analyze: (1) Which code change likely introduced this? (2) What’s the probable root cause? (3) What should I investigate first?

AI-Powered Log Analysis

AI prompt for pattern finding:

Here are logs from 50 failed requests: [PASTE LOGS]. And here are 10 successful requests for comparison: [PASTE]. Identify: (1) What patterns do the failed requests share? (2) What’s different between failed and successful requests? (3) What’s the most likely root cause? (4) What additional logging would help diagnose this further?

Debugging Without Reproduction

When you can’t reproduce a bug locally:

Technique	When to Use
Log analysis	Compare failed vs. successful request logs
Error tracking	Look at stack trace, context, frequency
Feature flags	Gradually enable/disable features to isolate
Canary deployment	Deploy fix to small percentage, monitor
Correlation analysis	Check if failures correlate with time, user type, region

✅ Quick Check: Your monitoring shows that error rate spiked from 0.1% to 5% at exactly 2:15 PM today. What’s the fastest way to identify the cause? (Answer: Check what changed at 2:15 PM: (1) Was there a deployment? Check CI/CD pipeline. (2) Did a dependency update? Check package versions. (3) Did an external service change? Check third-party status pages. (4) Did traffic pattern change? Check load metrics. The exact timing narrows the investigation dramatically — correlate the spike with events rather than debugging the errors individually.)

Key Takeaways

Structured logging (JSON format with timestamp, level, context, duration) is the foundation of production debugging — unstructured text logs are nearly impossible to search and analyze at scale, while structured logs enable AI analysis and automated pattern detection
Error tracking tools capture what print statements can’t: stack traces with full context (user, request, environment), frequency trends, and correlation with deployments — generic “Internal Server Error” messages mean your error tracking needs more detail, not that the bugs are mysterious
AI excels at log analysis: paste failed and successful request logs together and AI finds patterns (shared characteristics, missing fields, timing correlations) that would take hours to spot manually in thousands of log lines

Up Next

In the next lesson, you’ll learn root cause analysis — the 5 Whys technique for finding and fixing the underlying problem instead of just patching symptoms.

Knowledge Check

1. Users report that your app is 'slow' in production. Your local environment is fast. You have basic logging that says 'request received' and 'response sent.' How do you diagnose the performance issue?

Your logs don't capture enough information. Add timing logs at each step: (1) Request received: timestamp + request details. (2) Database query start/end: which query, how long it took. (3) External API call start/end: which API, response time, status. (4) Processing time: how long business logic took. (5) Response sent: total request duration. This 'instrumented logging' reveals WHERE the time is spent — is it the database? An API call? Processing? Without timing at each step, you're guessing. AI prompt: 'Add instrumented logging to this function to measure where time is spent at each step: [PASTE]. Log: operation name, start time, duration, and any relevant parameters. Format as structured JSON for easy analysis' Add more server capacity — if it's slow, it probably needs more resources Ask users to clear their browser cache — the issue is probably client-side

2. Your error tracking tool (Sentry, Bugsnag, etc.) shows 500 errors per hour on one endpoint. The error message is 'Internal Server Error.' That's all the information you have. What's your first step?

Fix the endpoint to return more specific error messages, then wait for the error to happen again. Current state: you only see the generic error the USER gets. You need the INTERNAL error — the actual exception, stack trace, and context. Fix: (1) Add error reporting that captures the full exception: `try { ... } catch (error) { errorTracker.captureException(error, { context: request.body }); throw error; }`. (2) Include request context: what data was sent, which user, what environment. (3) Deploy and wait for the error to occur naturally. (4) Once you have the real exception + stack trace + context, you can debug. The lesson: generic error messages in production mean your error tracking is incomplete — the fix is better error reporting before investigating the root cause Roll back the most recent deployment — the error probably came from a code change Scale up the servers — 500 errors might be caused by high load

3. You have logs from 10,000 requests. 50 of them failed with different error messages. You need to find the common pattern. How does AI help here?

AI can't help with log analysis — you need dedicated log analysis tools like Splunk or ELK AI excels at pattern recognition across large log datasets. Paste a sample of failed request logs and ask: 'Analyze these 50 failed requests and identify common patterns. Look for: (1) Do they share common input characteristics? (2) Do they fail at the same step? (3) Do they correlate with time of day, user type, or geographic region? (4) Are there common error messages or stack traces? (5) Is there a pattern in the successful requests that the failed ones lack?' AI finds patterns like: 'All 50 failures have user IDs starting with "legacy_", suggesting the bug affects users migrated from the old system' — a pattern that would take hours to find manually in 10,000 log lines Read all 10,000 log lines manually — there's no shortcut for log analysis

Answer all questions to check

Complete the quiz above first