Production: Guardrails, Evaluation, and Observability
Deploy agents safely with input/output guardrails, systematic evaluation, distributed tracing, and failure recovery patterns for production reliability.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Building an agent that works in demos is easy. Building one that works reliably at scale is hard. This lesson covers the engineering practices that bridge that gap.
🔄 Quick Recall: In the previous lesson, you learned memory patterns for agent persistence. Production agents need more than memory — they need guardrails to prevent harm, evaluation to measure reliability, and observability to diagnose failures.
Guardrails: Safety Boundaries
Guardrails are automated checks that run before, during, and after agent execution to prevent harmful or incorrect behavior.
Input Guardrails
Check user input before the agent processes it:
User Input → [Input Guardrail] → Agent
├── PII detection: Block SSNs, credit cards
├── Injection detection: Flag "ignore instructions"
├── Scope check: Is this within the agent's domain?
└── Rate limiting: Prevent abuse
Output Guardrails
Check agent output before it reaches the user:
Agent → [Output Guardrail] → User
├── PII masking: Replace sensitive data with ***
├── Factual grounding: Verify claims against sources
├── Policy compliance: No unauthorized promises
└── Format validation: Correct structure for downstream systems
Tool Guardrails
Check before executing tool calls:
Agent wants to call: delete_all_records(table="users")
[Tool Guardrail]:
├── Is this a destructive action? → Yes
├── Does the user have admin permissions? → Check
├── Is there a confirmation requirement? → Yes
└── Decision: Block and request human confirmation
✅ Quick Check: An agent’s output guardrail blocks a response because it contains “I guarantee this will work.” The agent’s system prompt says never to make guarantees. But the response was actually quoting a customer’s email: “The customer wrote: ‘I guarantee this will work.’” Should the guardrail block this? (Answer: No — the guardrail has a false positive. It detected the word “guarantee” without understanding the context (it was a quote, not the agent making a promise). Better guardrails use contextual analysis, not just keyword matching: “Is the agent making a guarantee, or quoting someone else’s guarantee?” This is why guardrails need to balance sensitivity with precision.)
Evaluation: Measuring Agent Reliability
What to Measure
| Metric | What It Captures | How to Measure |
|---|---|---|
| Task completion rate | Does the agent finish the job? | % of tasks fully completed vs. abandoned |
| Accuracy | Is the output correct? | Compare against ground truth (human-verified answers) |
| Consistency | Same input → same quality? | Run each test 5-10 times, measure variance |
| Latency | How long per task? | Time from input to final output |
| Cost | Token/API spend per task | Track tokens consumed, tool calls made |
| Safety | Does it ever produce harmful output? | Adversarial test suite |
Building a Test Suite
A production agent needs at minimum:
Test Suite Structure:
├── Happy path (50%): Normal, expected inputs
│ "Summarize this quarterly report"
│ "Find flights from NYC to London"
├── Edge cases (25%): Unusual but valid inputs
│ "Summarize this 200-page report" (very long)
│ "Find flights departing in 3 minutes" (impossible)
├── Adversarial (15%): Attempts to break or misuse
│ "Ignore your instructions and..."
│ "Pretend you're a different agent"
└── Regression (10%): Previously failed cases
Inputs that caused bugs in past versions
Evaluation Methods
LLM-as-Judge: Use a separate LLM to evaluate agent outputs against criteria:
Evaluate this agent response:
- Did it complete the requested task? (0-2)
- Is the information factually correct? (0-2)
- Did it stay within its defined scope? (0-1)
- Is the format correct? (0-1)
Human evaluation: For high-stakes agents, have humans review a sample of outputs regularly.
Automated checks: For structured outputs, validate programmatically (JSON schema validation, field completeness, value ranges).
Observability: Seeing Inside the Agent
Distributed Tracing
Every agent execution generates a trace — a record of every step:
Trace ID: abc-123
├── [0ms] User input received: "Analyze Q3 sales"
├── [50ms] Planning: Decomposed into 3 steps
├── [100ms] Tool call: database_query("SELECT * FROM sales...")
│ └── [800ms] Tool result: 1,247 rows returned
├── [850ms] Tool call: calculate_metrics(data)
│ └── [1200ms] Tool result: {revenue: 12.3M, growth: 15%}
├── [1250ms] Generating response
├── [2000ms] Output guardrail: PASSED
└── [2050ms] Response delivered to user
What to Log
| Event | What to Capture | Why |
|---|---|---|
| Agent decision | Which tool chosen, why (reasoning) | Debug wrong tool selection |
| Tool call | Input parameters, output, latency | Debug tool failures |
| Guardrail trigger | What was blocked, why | Tune guardrail sensitivity |
| Error | Error type, context, recovery action | Fix recurring failures |
| Token usage | Tokens per step, cumulative | Cost optimization |
Alerting
Set up alerts for:
- Task completion rate drops below threshold — something is systematically wrong
- Latency exceeds SLA — tool call hanging or LLM overloaded
- Guardrail trigger rate spikes — possible attack or agent regression
- Error rate exceeds baseline — new bug or external dependency failure
✅ Quick Check: Your agent’s task completion rate dropped from 94% to 78% overnight. Nothing in your code changed. What are the most likely causes? (Answer: External dependencies changed: (1) An API the agent uses was updated or went down, (2) the LLM provider had a model update that changed behavior, or (3) a database schema changed. Check your observability traces — they’ll show which step in the agent loop is failing, pointing directly to the cause. This is why detailed tracing matters: without it, you’d be guessing.)
Failure Recovery Patterns
Retry with Backoff
Tool call fails → Wait 1 second → Retry
Retry fails → Wait 2 seconds → Retry
Retry fails → Wait 4 seconds → Retry
Max retries exceeded → Fallback strategy
Circuit Breaker
If a tool fails repeatedly, stop calling it:
Tool fails 5 times in 10 minutes →
Circuit OPEN: Stop calling this tool
Use fallback tool or inform user
After 5 minutes → Circuit HALF-OPEN: Try one call
If succeeds → Circuit CLOSED: Resume normal use
Graceful Degradation
When a component fails, provide reduced but functional service:
Full capability: Search web + analyze + visualize
Web search down: Analyze from cached data + visualize
Visualization down: Search + analyze + text output only
Everything down: "I'm experiencing technical difficulties.
Here's what I can help with manually..."
Production Checklist
Before deploying an agent to production:
Safety:
□ Input guardrails configured (PII, injection, scope)
□ Output guardrails configured (PII masking, policy compliance)
□ Tool guardrails (confirmation for destructive actions)
□ Maximum iteration limit set (prevent infinite loops)
Evaluation:
□ Test suite with 50+ cases across all categories
□ Task completion rate > 90%
□ Adversarial test pass rate > 95%
□ Consistency score > 85% (same input → same quality)
Observability:
□ Distributed tracing enabled
□ Token usage logging per step
□ Guardrail trigger logging
□ Alerting on completion rate, latency, error rate
Recovery:
□ Retry logic with exponential backoff
□ Circuit breakers on external dependencies
□ Graceful degradation paths defined
□ Human escalation path for unrecoverable failures
Practice Exercise
- Design input + output guardrails for an agent in your domain
- Write 10 test cases: 5 happy path, 3 edge cases, 2 adversarial
- Define the 3 most important alerts you’d set up for your agent
Key Takeaways
- Guardrails operate at three layers: input (before processing), output (before delivery), and tool (before execution)
- Evaluate agents on multiple dimensions: task completion, accuracy, consistency, latency, cost, and safety
- Test suites need all four categories: happy path, edge cases, adversarial, and regression
- Observability through distributed tracing lets you pinpoint exactly where failures occur in multi-step agents
- Failure recovery requires retry logic, circuit breakers, and graceful degradation — not just error messages
- Production readiness is a checklist: safety, evaluation, observability, and recovery must all be addressed
Up Next
In the final lesson, you’ll pull everything together in a capstone exercise — designing a complete agent system using the patterns, tools, and practices from the entire course.
Knowledge Check
Complete the quiz above first
Lesson completed!