Error Handling and Edge Cases
When automations break (and they will), graceful error handling is the difference between a minor hiccup and a disaster. Build resilient workflows.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skills included
- New content added weekly
The 3 AM Wake-Up Call
In the previous lesson, we explored data processing and multi-step workflows. Now let’s build on that foundation. Your automation has been running smoothly for three weeks. You’ve almost forgotten about it. Then at 3 AM, your phone buzzes: “42 duplicate invoices sent to clients.”
What happened? The billing API had a temporary outage. Your automation retried, but the retry logic was wrong. Each retry didn’t check if the invoice had already been sent. So every retry created a new invoice. And your automation dutifully sent each one.
This isn’t a hypothetical. Automation failures cause real damage: duplicate charges, missed communications, corrupted data, embarrassing emails to clients. And the worst failures are the ones nobody notices until a customer complains.
Error handling isn’t the exciting part of automation. But it’s the part that separates a reliable system from a ticking time bomb.
What You’ll Learn
By the end of this lesson, you’ll identify common failure modes in automations, design retry strategies that don’t make things worse, handle edge cases before they become problems, and build monitoring that catches failures fast.
From Happy Path to Reality
In Lessons 3-5, we designed automations along the “happy path” – the scenario where everything works as expected. This lesson is about everything else. The data that’s missing. The API that’s down. The user who submits the form twice. The date that doesn’t parse. Real-world automations spend more time handling exceptions than processing normal cases.
Failure Mode Inventory
Before you can handle errors, you need to know what can go wrong. Here are the most common failure modes:
External Failures (Things break outside your control)
| Failure | Impact | Frequency |
|---|---|---|
| API timeout | Step can’t complete | Common |
| API rate limit exceeded | Too many requests, step blocked | Common |
| Service outage | Entire system unavailable | Occasional |
| Authentication expired | Credentials no longer valid | Occasional |
| Data source changed | Fields moved, renamed, or deleted | Rare |
Data Failures (The data isn’t what you expected)
| Failure | Impact | Frequency |
|---|---|---|
| Missing required field | Can’t process the record | Common |
| Wrong data format | Transformation fails | Common |
| Duplicate submission | Same trigger fires twice | Occasional |
| Null or empty values | Calculations fail, templates render wrong | Common |
| Unexpected characters | Special characters break parsing | Occasional |
Logic Failures (Your automation does the wrong thing)
| Failure | Impact | Frequency |
|---|---|---|
| Condition evaluates wrong | Data routed incorrectly | Occasional |
| Loop doesn’t terminate | Runs forever, eats resources | Rare but catastrophic |
| Race condition | Parallel steps conflict | Occasional |
| Stale data | Step uses outdated information | Occasional |
Quick Check
Think about one of your automation candidates. For each step, ask: “What happens if this step fails?” If your answer is “I don’t know” or “nothing, I guess,” that’s a vulnerability.
Retry Strategies
When a step fails due to a temporary issue (API timeout, network glitch), retrying often resolves it. But retrying incorrectly can make things worse.
The naive retry (DON’T DO THIS):
If step fails: immediately retry
If retry fails: immediately retry again
Repeat forever
This hammers the failing service, can create duplicates, and never stops.
The smart retry:
If step fails: wait 1 minute, retry (attempt 2 of 3)
If retry fails: wait 5 minutes, retry (attempt 3 of 3)
If still fails: STOP, mark as failed, alert human
Exponential backoff with jitter: The gold standard for retries. Each retry waits longer, and a random element prevents multiple automations from retrying simultaneously.
Attempt 1: Immediate
Attempt 2: Wait 1-2 minutes (random)
Attempt 3: Wait 4-8 minutes (random)
Attempt 4: Wait 16-32 minutes (random)
If all fail: Alert human
Idempotency: The Safety Net
Before retrying, always check: “Has this step already succeeded?” If Step 3 sent an invoice but Step 4 failed, retrying from Step 3 should NOT send a second invoice.
Design each step to be idempotent – running it twice produces the same result as running it once. Techniques:
- Check for existing records before creating new ones
- Use unique IDs to prevent duplicates
- Verify state before taking action
Handling Edge Cases
Edge cases are the unusual-but-valid scenarios that break automations. The best time to find them is during design – not after they’ve caused a problem.
Common edge cases to test:
Empty/null data:
- What if the customer name field is blank?
- What if the amount is $0.00?
- What if the email address is missing?
Boundary values:
- What if the date is January 1 (year boundary)?
- What if the order quantity is 1? What about 10,000?
- What if the text contains 50,000 characters?
Format variations:
- What if the phone number is “(555) 123-4567” vs “5551234567” vs “+1-555-123-4567”?
- What if the name has special characters: “O’Brien,” “Garcia-Lopez,” “St. John”?
- What if dates are in DD/MM/YYYY instead of MM/DD/YYYY?
Timing edge cases:
- What if the trigger fires twice within 1 second (duplicate submission)?
- What happens during daylight saving time transitions?
- What about timezone differences between systems?
AI prompt for edge case discovery:
I'm building an automation that [description].
Here are the main steps:
1. [Step 1]
2. [Step 2]
3. [Step 3]
For each step, identify:
- 3-5 edge cases that could cause unexpected behavior
- What would happen if each edge case occurred
- How to handle each case gracefully
Also identify any cross-step edge cases where a
combination of conditions could cause problems.
Building Error Handling Into Your Design
For every step in your automation, define three things:
1. What success looks like:
Step 3: Create customer record in billing system
SUCCESS: Record created, billing_id returned
2. What failure looks like:
FAILURE MODES:
- API timeout: No response within 30 seconds
- Duplicate: Customer with this email already exists
- Validation error: Required fields missing
- Permission error: API key doesn't have write access
3. What to do for each failure:
HANDLING:
- API timeout: Retry with exponential backoff (3 attempts)
- Duplicate: Log warning, use existing record ID, continue
- Validation error: Log error with field details, skip record,
alert admin
- Permission error: Alert admin immediately, pause automation
Here’s a template:
For this automation step:
[Describe the step]
Define error handling:
1. Expected input: [What data should this step receive?]
2. Validation: [How do you verify the input is correct?]
3. Success output: [What does this step produce when it works?]
4. Failure modes: [List everything that could go wrong]
5. Recovery actions: [For each failure, what should happen?]
6. Alert rules: [When should a human be notified?]
7. Logging: [What should be recorded for troubleshooting?]
Monitoring and Alerting
Error handling catches problems in real-time. Monitoring catches problems over time.
What to monitor:
| Metric | What it tells you | Alert threshold |
|---|---|---|
| Success rate | % of runs completing without errors | Below 95% |
| Execution time | How long the automation takes | More than 2x normal |
| Error count | Number of failures per day | Any increase from baseline |
| Records processed | Volume of items handled | Unexpected drops or spikes |
| Queue depth | Backlog of unprocessed items | Growing consistently |
Monitoring cadence:
- Real-time alerts: For failures that need immediate attention (data corruption, duplicate sends)
- Daily digest: Summary of all runs, errors, and warnings
- Weekly review: Trends, success rates, performance patterns
The Error Handling Checklist
Before deploying any automation, verify:
- Every step has defined success and failure states
- Retry logic includes maximum attempts and backoff
- Retries are idempotent (safe to repeat)
- Null/empty data is handled at every step
- Duplicate triggers are detected and managed
- Unrecoverable errors alert a human with context
- All errors are logged with enough detail to troubleshoot
- There’s a manual override to pause or stop the automation
Exercise: Add Error Handling to Your Workflow
Take the multi-step workflow you designed in Lesson 5’s exercise. For each step:
- List 2-3 things that could go wrong
- Define how the automation should respond to each
- Design the retry strategy (if applicable)
- Specify when a human should be alerted
- Identify one edge case the step should handle
Use the AI prompt above to help discover edge cases you might miss.
Key Takeaways
- Silent failures are worse than loud failures – always alert someone when something breaks
- Retry strategies need maximum attempts, increasing delays, and idempotency checks
- Edge cases cause most automation failures – test for empty data, boundary values, format variations, and timing issues
- For every step: define success, list failure modes, and specify recovery actions
- Monitor success rate, execution time, and error count to catch gradual degradation
- The error handling checklist prevents the most common deployment mistakes
Next lesson: you’ve designed and error-proofed your automations. Now let’s test them properly and optimize for long-term reliability.
Knowledge Check
Complete the quiz above first
Lesson completed!