Error Handling and Edge Cases

The 3 AM Wake-Up Call

In the previous lesson, we explored data processing and multi-step workflows. Now let’s build on that foundation. Your automation has been running smoothly for three weeks. You’ve almost forgotten about it. Then at 3 AM, your phone buzzes: “42 duplicate invoices sent to clients.”

What happened? The billing API had a temporary outage. Your automation retried, but the retry logic was wrong. Each retry didn’t check if the invoice had already been sent. So every retry created a new invoice. And your automation dutifully sent each one.

This isn’t a hypothetical. Automation failures cause real damage: duplicate charges, missed communications, corrupted data, embarrassing emails to clients. And the worst failures are the ones nobody notices until a customer complains.

Error handling isn’t the exciting part of automation. But it’s the part that separates a reliable system from a ticking time bomb.

What You’ll Learn

By the end of this lesson, you’ll identify common failure modes in automations, design retry strategies that don’t make things worse, handle edge cases before they become problems, and build monitoring that catches failures fast.

From Happy Path to Reality

In Lessons 3-5, we designed automations along the “happy path” – the scenario where everything works as expected. This lesson is about everything else. The data that’s missing. The API that’s down. The user who submits the form twice. The date that doesn’t parse. Real-world automations spend more time handling exceptions than processing normal cases.

Failure Mode Inventory

Before you can handle errors, you need to know what can go wrong. Here are the most common failure modes:

External Failures (Things break outside your control)

Failure	Impact	Frequency
API timeout	Step can’t complete	Common
API rate limit exceeded	Too many requests, step blocked	Common
Service outage	Entire system unavailable	Occasional
Authentication expired	Credentials no longer valid	Occasional
Data source changed	Fields moved, renamed, or deleted	Rare

Data Failures (The data isn’t what you expected)

Failure	Impact	Frequency
Missing required field	Can’t process the record	Common
Wrong data format	Transformation fails	Common
Duplicate submission	Same trigger fires twice	Occasional
Null or empty values	Calculations fail, templates render wrong	Common
Unexpected characters	Special characters break parsing	Occasional

Logic Failures (Your automation does the wrong thing)

Failure	Impact	Frequency
Condition evaluates wrong	Data routed incorrectly	Occasional
Loop doesn’t terminate	Runs forever, eats resources	Rare but catastrophic
Race condition	Parallel steps conflict	Occasional
Stale data	Step uses outdated information	Occasional

Quick Check

Think about one of your automation candidates. For each step, ask: “What happens if this step fails?” If your answer is “I don’t know” or “nothing, I guess,” that’s a vulnerability.

Retry Strategies

When a step fails due to a temporary issue (API timeout, network glitch), retrying often resolves it. But retrying incorrectly can make things worse.

The naive retry (DON’T DO THIS):

If step fails: immediately retry
If retry fails: immediately retry again
Repeat forever

This hammers the failing service, can create duplicates, and never stops.

The smart retry:

If step fails: wait 1 minute, retry (attempt 2 of 3)
If retry fails: wait 5 minutes, retry (attempt 3 of 3)
If still fails: STOP, mark as failed, alert human

Exponential backoff with jitter: The gold standard for retries. Each retry waits longer, and a random element prevents multiple automations from retrying simultaneously.

Attempt 1: Immediate
Attempt 2: Wait 1-2 minutes (random)
Attempt 3: Wait 4-8 minutes (random)
Attempt 4: Wait 16-32 minutes (random)
If all fail: Alert human

Idempotency: The Safety Net

Before retrying, always check: “Has this step already succeeded?” If Step 3 sent an invoice but Step 4 failed, retrying from Step 3 should NOT send a second invoice.

Design each step to be idempotent – running it twice produces the same result as running it once. Techniques:

Check for existing records before creating new ones
Use unique IDs to prevent duplicates
Verify state before taking action

Handling Edge Cases

Edge cases are the unusual-but-valid scenarios that break automations. The best time to find them is during design – not after they’ve caused a problem.

Common edge cases to test:

Empty/null data:

What if the customer name field is blank?
What if the amount is $0.00?
What if the email address is missing?

Boundary values:

What if the date is January 1 (year boundary)?
What if the order quantity is 1? What about 10,000?
What if the text contains 50,000 characters?

Format variations:

What if the phone number is “(555) 123-4567” vs “5551234567” vs “+1-555-123-4567”?
What if the name has special characters: “O’Brien,” “Garcia-Lopez,” “St. John”?
What if dates are in DD/MM/YYYY instead of MM/DD/YYYY?

Timing edge cases:

What if the trigger fires twice within 1 second (duplicate submission)?
What happens during daylight saving time transitions?
What about timezone differences between systems?

AI prompt for edge case discovery:

I'm building an automation that [description].

Here are the main steps:
1. [Step 1]
2. [Step 2]
3. [Step 3]

For each step, identify:
- 3-5 edge cases that could cause unexpected behavior
- What would happen if each edge case occurred
- How to handle each case gracefully

Also identify any cross-step edge cases where a
combination of conditions could cause problems.

Building Error Handling Into Your Design

For every step in your automation, define three things:

1. What success looks like:

Step 3: Create customer record in billing system
SUCCESS: Record created, billing_id returned

2. What failure looks like:

FAILURE MODES:
- API timeout: No response within 30 seconds
- Duplicate: Customer with this email already exists
- Validation error: Required fields missing
- Permission error: API key doesn't have write access

3. What to do for each failure:

HANDLING:
- API timeout: Retry with exponential backoff (3 attempts)
- Duplicate: Log warning, use existing record ID, continue
- Validation error: Log error with field details, skip record,
  alert admin
- Permission error: Alert admin immediately, pause automation

Here’s a template:

For this automation step:
[Describe the step]

Define error handling:

1. Expected input: [What data should this step receive?]
2. Validation: [How do you verify the input is correct?]
3. Success output: [What does this step produce when it works?]
4. Failure modes: [List everything that could go wrong]
5. Recovery actions: [For each failure, what should happen?]
6. Alert rules: [When should a human be notified?]
7. Logging: [What should be recorded for troubleshooting?]

Monitoring and Alerting

Error handling catches problems in real-time. Monitoring catches problems over time.

What to monitor:

Metric	What it tells you	Alert threshold
Success rate	% of runs completing without errors	Below 95%
Execution time	How long the automation takes	More than 2x normal
Error count	Number of failures per day	Any increase from baseline
Records processed	Volume of items handled	Unexpected drops or spikes
Queue depth	Backlog of unprocessed items	Growing consistently

Monitoring cadence:

Real-time alerts: For failures that need immediate attention (data corruption, duplicate sends)
Daily digest: Summary of all runs, errors, and warnings
Weekly review: Trends, success rates, performance patterns

The Error Handling Checklist

Before deploying any automation, verify:

Every step has defined success and failure states
Retry logic includes maximum attempts and backoff
Retries are idempotent (safe to repeat)
Null/empty data is handled at every step
Duplicate triggers are detected and managed
Unrecoverable errors alert a human with context
All errors are logged with enough detail to troubleshoot
There’s a manual override to pause or stop the automation

Exercise: Add Error Handling to Your Workflow

Take the multi-step workflow you designed in Lesson 5’s exercise. For each step:

List 2-3 things that could go wrong
Define how the automation should respond to each
Design the retry strategy (if applicable)
Specify when a human should be alerted
Identify one edge case the step should handle

Use the AI prompt above to help discover edge cases you might miss.

Key Takeaways

Silent failures are worse than loud failures – always alert someone when something breaks
Retry strategies need maximum attempts, increasing delays, and idempotency checks
Edge cases cause most automation failures – test for empty data, boundary values, format variations, and timing issues
For every step: define success, list failure modes, and specify recovery actions
Monitor success rate, execution time, and error count to catch gradual degradation
The error handling checklist prevents the most common deployment mistakes

Next lesson: you’ve designed and error-proofed your automations. Now let’s test them properly and optimize for long-term reliability.