Production Patterns: Error Handling & Deployment

🔄 You’ve built an email classifier (Lesson 3), a research agent (Lesson 4), a chatbot with memory (Lesson 5), and a RAG knowledge base (Lesson 6). They all work in testing. But “works in testing” and “works in production” are very different things. This lesson bridges that gap.

The Five Production Concerns

Every AI workflow that serves real users needs to handle five things your test environment ignores:

Error handling — What happens when the LLM API is down?
Retry logic — How do you recover from transient failures?
Queue mode — How do you handle concurrent users?
Credential security — How do you keep API keys safe?
Monitoring — How do you know something went wrong?

Let’s tackle each one.

Error Handling

n8n gives you three levels of error handling:

Node-Level: Retry on Fail

Every node has a “Retry On Fail” setting in its options. For AI nodes that call external APIs (OpenAI, Anthropic, SerpAPI), enable this:

Max Retries: 3
Wait Between Retries: exponential backoff (1s → 2s → 4s)
Retry On: specific HTTP status codes (429 rate limit, 500 server error, 503 unavailable)

This handles the most common failure: an API that’s temporarily overloaded. The workflow pauses, retries, and continues — no manual intervention needed.

Node-Level: Error Outputs

Since n8n 2.0, every node has an error output. If a node fails (even after retries), the error output sends the failed item to a different path. You can:

Route errors to a Slack notification: “Email classifier failed for message from {{$json.from}}”
Log errors to a Google Sheet for later review
Send the item to a fallback workflow

This is critical for AI workflows. LLMs occasionally return unexpected output, malformed JSON, or timeout — catching these errors prevents your entire workflow from crashing.

Workflow-Level: Error Workflow

n8n has a global Error Workflow feature. Create a separate workflow that fires whenever any workflow in your instance fails. It receives the error details, the workflow name, and the execution ID.

A typical error workflow sends a Slack message:

🚨 Workflow "AI Email Classifier" failed
Error: OpenAI API returned 429 (rate limit)
Execution: #48293
Time: 2026-03-05 14:23:00

Set this up in Settings → Workflows → Error Workflow.

✅ Quick Check: Your RAG workflow crashes because the Supabase vector store is temporarily unavailable. What error handling should you have in place? (Answer: Three layers. (1) Retry on fail for the Supabase node — with 3 retries and exponential backoff. (2) An error output on the Supabase node that routes to a fallback response: “I’m having trouble accessing the knowledge base right now. Please try again in a moment.” (3) The global error workflow sends a Slack alert so you know the vector store is down.)

Queue Mode: Handling Concurrent Users

By default, n8n runs workflows sequentially — one execution at a time. The second user waits until the first finishes. For AI workflows that take 5-10 seconds per execution, this creates painful bottlenecks.

Queue mode fixes this by using Redis as a message broker:

Workflow triggers create “jobs” in a Redis queue
Worker processes pick up jobs and execute them concurrently
Multiple workers can run on the same machine or across multiple servers

To enable queue mode (self-hosted):

# In your environment variables
EXECUTIONS_MODE=queue
QUEUE_BULL_REDIS_HOST=localhost
QUEUE_BULL_REDIS_PORT=6379

n8n Cloud enables queue mode automatically — no configuration needed.

Important: Remember from Lesson 5 that Simple Memory doesn’t work in queue mode. When you switch to queue mode, any workflow using Simple Memory will lose conversation history between messages. This is why PostgreSQL or Redis Memory is required for production chatbots.

Credential Management

n8n’s credential system encrypts secrets at rest. But there are still practices to follow:

Do:

Use n8n’s credential nodes for every service (OpenAI, Gmail, Slack, Supabase)
Export workflows to Git — credential data is automatically excluded
Rotate API keys periodically
Use the External Secrets feature for enterprise environments (AWS Secrets Manager, HashiCorp Vault)

Don’t:

Hardcode API keys in Code nodes or expressions
Share workflow exports that contain secrets in Set nodes
Use the same API key across dev, staging, and production

Environment separation: For serious deployments, run separate n8n instances for development and production. Export workflows from dev as JSON, import into production, and configure production credentials separately. This prevents test credentials from leaking and production credentials from being used in experiments.

✅ Quick Check: A team member wants to share a workflow that includes an OpenAI credential. They export the workflow JSON and email it. Is this safe? (Answer: Yes, for the credential itself — n8n excludes credential data from exports by design. But double-check that no one hardcoded keys in Code nodes or Set node fields. The recipient will need to configure their own OpenAI credential and connect it to the imported workflow.)

Monitoring and Logging

n8n provides execution logs by default — every workflow run is recorded with inputs, outputs, and timing. But for production AI workflows, you need more:

What to monitor:

Metric	Why It Matters	How to Track
Execution time	AI calls are slow — detect when they get slower	n8n execution logs (built-in)
Success rate	Catch when error rates spike	Error workflow + dashboard
Token usage	Control LLM costs	OpenAI dashboard or middleware
Memory usage	Large conversation histories consume RAM	Server monitoring
Queue depth	Detect backlog buildup	Redis monitoring

Simple monitoring setup:

Create the Error Workflow (sends alerts to Slack/email)
Add a “log” node at the end of important workflows that writes execution data to Google Sheets or a database
Check the n8n execution list daily for failed runs

For advanced monitoring, n8n supports Prometheus metrics export and Sentry integration.

Human-in-the-Loop

Some AI decisions shouldn’t be fully automated. n8n supports a “send and wait” pattern for human oversight:

AI Agent classifies an email as “urgent — escalate to legal”
Instead of auto-sending to legal, the workflow sends a Slack message: “AI wants to escalate this to legal. Approve or reject?”
The workflow waits for a human to click “Approve” or “Reject”
Based on the response, the workflow continues or aborts

Use the Send Message and Wait for Response feature (available in Slack, Email, and other nodes). This is especially important for high-stakes AI decisions — approving expenses, sending external communications, or modifying data.

Putting It All Together: Production Checklist

Before activating any AI workflow for real users:

Error Handling:
- [ ] Retry on fail enabled for all external API nodes (3 retries, exponential backoff)
- [ ] Error outputs configured on AI nodes with fallback responses
- [ ] Global Error Workflow set up with Slack/email alerts

Performance:
- [ ] Queue mode enabled (self-hosted) or confirmed (Cloud)
- [ ] Memory type is PostgreSQL or Redis (NOT Simple Memory)
- [ ] Window Buffer configured to limit conversation history

Security:
- [ ] All credentials stored in n8n's credential system
- [ ] No hardcoded API keys in Code nodes or expressions
- [ ] Workflow JSON reviewed before committing to Git

Monitoring:
- [ ] Execution logs retained for troubleshooting
- [ ] Token usage tracked (check LLM provider dashboard weekly)
- [ ] Human-in-the-loop for high-stakes decisions

Key Takeaways

Use three layers of error handling: node retry, error outputs, and a global error workflow
Queue mode (with Redis) enables concurrent execution — required for multi-user production deployments
Never hardcode credentials — use n8n’s credential system and export workflows safely to Git
Simple Memory breaks in queue mode — always use PostgreSQL or Redis Memory in production
Monitor token usage — AI agent loops can burn through tokens faster than you expect
Add human-in-the-loop for high-stakes decisions using the send-and-wait pattern

Up Next

Final lesson. In the Capstone, you’ll build a complete AI assistant — combining classification, agent tools, persistent memory, RAG retrieval, error handling, and MCP connectivity into a single production-ready workflow. Everything from Lessons 1-7 comes together.