Claude Outcomes: The Rubric That Boosted Task Success 10 Points

Dreaming got all the press from Anthropic’s May 6 announcements. But the feature engineers are actually wiring into production right now is Outcomes — a rubric-based grading system that adds a second agent whose only job is to check whether work meets the bar you set.

Anthropic’s internal tests showed +10 percentage points on task success across the board, with +8.4% better Word documents and +10.1% better PowerPoint decks. Same model, same prompts — only difference was Outcomes wrapping the agent in a grade-and-revise loop.

If you’re building production agents, this is the feature most likely to move your reliability numbers in the next 30 days. Here’s what it actually does, what a real rubric looks like, and where it falls down.

What Is Outcomes? (The Simple Version)

Picture a junior employee writing a customer support reply. They draft it, hand it to their manager, manager reads it, manager says “no, the tone is off and you didn’t link to the help article.” Junior employee revises. Manager reads again. After 2-3 rounds, the reply ships.

Outcomes is that loop, but both employees are Claude.

You define a rubric — a structured list of criteria the output has to satisfy. Claude does the work (writes the reply, generates the deck, classifies the leads). A separate Claude instance — the grader — reads the output, checks each criterion, and either passes it or says exactly which criteria failed and why. If anything failed, the worker revises. Loop until pass or you hit your max-iteration budget.

The trick that makes this different from “ask Claude to self-critique” is the separation of context. The grader doesn’t see the worker’s reasoning, scratchpad, or tools — only the rubric and the final artifact. That prevents the worker from rubber-stamping its own work, which is the failure mode of every self-reflection prompt anyone has ever shipped.

How It Works Technically

The architecture has three parts:

Worker agent — does the actual work (writes the doc, generates the slides, classifies leads).
Grader agent — runs in its own context window. Sees only the worker’s latest output plus the rubric.
Managed Agents harness — orchestrates the loop. Calls worker → calls grader → reads result → either calls worker again with the gap analysis, or returns success.

The grader returns three things every round:

A per-criterion pass/fail
An overall pass/fail (computed from required criteria + weights)
A natural-language “gap analysis” — what failed and why

When the grader fails an output, the harness feeds the gap analysis to the worker as a new user message: “Here’s what didn’t meet the bar: [list]. Revise.” The worker doesn’t see the rubric directly — it sees the targeted feedback. That keeps the worker focused on fixing what’s broken instead of gaming the rubric structure.

The loop is bounded:

max_iterations defaults to 3, hard-capped at 20.
Each iteration is at least 2 model calls (grader + worker revision).
Webhook events (session.outcome_evaluation_ended) fire when grading or the overall outcome completes — so you can integrate with your existing job queue and dashboards.

That’s it. The whole feature is “second agent, isolated context, rubric, iteration budget.”

The +10 Point Number — What Anthropic Actually Tested

The headline number floating around is “+10 percentage points task success.” Here’s what we know about the methodology:

What it measured: Task success rate on real-world style agent jobs — document generation, multi-step reasoning, Office formatting work. Scored against rubric-like criteria (structure, coverage, correctness).
The comparison: Same base model, same prompts. Baseline = standard prompting loop without rubric grading. Test = Outcomes wrapping the agent.
The specific deltas: +8.4% on docx (Word) generation, +10.1% on pptx (PowerPoint) generation.
The honest gap: Anthropic hasn’t published exact dataset composition, sample sizes, or a formal benchmark name. The numbers come from internal evals communicated via docs and blog posts, not a peer-reviewed benchmark.

Translation: this is a real, measurable lift on real-shaped workloads. It’s not benchmark theater. But it also isn’t a peer-reviewed paper.

The takeaway for builders: don’t expect a flat +10 points everywhere. Expect a similar shape of gain — biggest on tasks where “quality” can be cleanly specified in a rubric (Office docs, structured outputs, code that has to pass tests), smaller or zero on tasks where quality is fuzzy (creative writing, exploratory analysis).

What a Real Rubric Looks Like

Anthropic’s docs describe rubrics in prose plus high-level structure. They don’t publish a single canonical JSON schema, but the shape is consistent across examples. Here’s a representative pattern:

{
  "name": "support_reply_quality_v1",
  "max_iterations": 4,
  "criteria": [
    {
      "id": "factual_correctness",
      "description": "All statements about product behavior, pricing, and policies are supported by the knowledge base. No invented features, prices, or guarantees.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.4
    },
    {
      "id": "tone_and_brand",
      "description": "Polite, empathetic, matches brand voice. No slang, no blame, concise sentences.",
      "scale": {
        "type": "ordinal",
        "labels": ["poor", "ok", "great"]
      },
      "min_label": "ok",
      "weight": 0.2
    },
    {
      "id": "actionability",
      "description": "If issue is resolvable, gives concrete steps or links. If not, explains what's missing.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.4
    }
  ]
}

The pattern that works:

Binary or small ordinal scales. Don’t try 1-10 scoring — the grader can’t reliably distinguish 7 from 8. Use pass/fail or 3-level ordinals.
required flag for non-negotiables. Tone might be “ok” (acceptable). Factual correctness has to be “pass” — no exceptions.
Weights for soft criteria. Sum to 1.0. The grader uses these to compute the overall pass/fail.
Descriptions written like job specs. “Empathetic” is fuzzy. “Acknowledges the customer’s frustration, uses ‘I understand’ or equivalent in the opening, avoids ‘unfortunately’” is specific.

Three Rubrics That Work in Production

Below are three patterns we’ve seen work for real teams. Plug them in and adjust the criteria descriptions to your context.

Rubric 1: Code Review PR

For agents that review PRs and leave actionable comments.

{
  "name": "code_review_pr_v1",
  "max_iterations": 3,
  "criteria": [
    {
      "id": "compiles_and_tests",
      "description": "Suggested changes are syntactically valid in the target language and do not break existing tests or APIs described in context.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.35
    },
    {
      "id": "correctness_and_safety",
      "description": "Review identifies logical bugs, edge cases, and security issues evident from the diff (injection, auth, data races).",
      "scale": {
        "type": "ordinal",
        "labels": ["missed_major", "adequate", "thorough"]
      },
      "min_label": "adequate",
      "weight": 0.3
    },
    {
      "id": "specific_feedback",
      "description": "Comments reference concrete lines or blocks, not just general advice.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.2
    },
    {
      "id": "style_and_scope",
      "description": "Feedback respects repo style guides and keeps scope focused on this PR.",
      "scale": {
        "type": "ordinal",
        "labels": ["poor", "ok", "great"]
      },
      "min_label": "ok",
      "weight": 0.15
    }
  ]
}

Rubric 2: Lead Scoring

For agents that score sales leads and assign next-step playbooks.

{
  "name": "lead_scoring_v1",
  "max_iterations": 3,
  "criteria": [
    {
      "id": "score_range",
      "description": "Lead score is an integer 0-100, matches tier ranges (0-29 cold, 30-69 warm, 70-100 hot).",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.25
    },
    {
      "id": "justification_alignment",
      "description": "Textual justification cites real firmographic and behavioral data from the CRM record. No invented fields.",
      "scale": {
        "type": "ordinal",
        "labels": ["incorrect", "partial", "full"]
      },
      "min_label": "partial",
      "weight": 0.3
    },
    {
      "id": "playbook_mapping",
      "description": "Output includes the correct next-step playbook ID (P1/P2/P3) given score and ICP rules.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.25
    },
    {
      "id": "formatting",
      "description": "Result matches schema {lead_id, score, tier, playbook_id, rationale}.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.2
    }
  ]
}

For structured outputs like this, pair the grader rubric with code-based schema validation. Belt and suspenders — both catch different things.

Rubric 3: Slide Deck Generation

For the docx/pptx use case where the +10.1% lift was measured.

{
  "name": "deck_generation_v1",
  "max_iterations": 4,
  "criteria": [
    {
      "id": "section_coverage",
      "description": "Deck includes Problem, Solution, How It Works, Results, Next Steps as separate slides.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.3
    },
    {
      "id": "visual_hierarchy",
      "description": "Each slide has a clear title separated from body content. Body has <5 bullets. No wall-of-text slides.",
      "scale": {
        "type": "ordinal",
        "labels": ["poor", "ok", "great"]
      },
      "min_label": "ok",
      "weight": 0.25
    },
    {
      "id": "data_accuracy",
      "description": "Any numbers, metrics, or quotes match the source brief. No invented stats.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.3
    },
    {
      "id": "brand_consistency",
      "description": "Uses the brand color palette and typography described in the system prompt.",
      "scale": { "type": "binary", "labels": ["fail", "pass"] },
      "required": true,
      "weight": 0.15
    }
  ]
}

What Outcomes Can’t Do (Five Honest Limits)

This is the part where most blog posts about new AI tools get vague. The limits matter.

1. Vague rubrics produce “rubric-good, product-bad” outputs. If your criterion says “tone is empathetic,” the worker will learn to add empathy padding to every response, even ones that should be terse. Specific rubric descriptions are not optional.

2. Token cost scales with iterations. Each round is at least 2 calls (worker revision + grader). Setting max_iterations to 10 because “more is better” can 5x your cost while the rubric never passes. Budget conservatively (3-5) and audit failure patterns.

3. Grader blind spots. The grader only sees what you pass it. If a piece of context (a brand voice doc, a legal constraint, a customer history) isn’t in the grader’s input, the grader can’t enforce it. False passes happen here.

4. The latency hit kills interactive UX. Two to four extra model calls is fine for async work (overnight document generation, nightly classifications) but terrible for live chat. Anthropic positions Outcomes for “long-running tasks and asynchronous work” — that’s not marketing, it’s a fitness constraint.

5. LLM-as-judge gaming. Even with separate context, you’re still using a model to judge model output. Underspecified rubrics let the worker learn patterns that satisfy the grader but not actual users (e.g., always adding caveats to look “safer”). Anthropic’s own guidance is “read transcripts, adjust rubrics when grader makes systematic mistakes” — that’s a real ongoing maintenance task, not a one-time setup.

How Outcomes Compares to Other Patterns You Might Be Using

Approach	Where it runs	How it improves quality	Best for	Weak spots
Claude Outcomes	Managed Agents harness	Separate grader agent + rubric + bounded loop	Production agents on async tasks, Office docs, structured outputs	Latency, rubric design overhead
LangChain self-reflection	Your app	Same agent critiques its own output	Quick adoption, any provider	Critic shares context — rubber-stamping risk
AutoGen debate	Your infra	Multiple agents argue and refine	Reasoning diversity, exploratory tasks	Hard to bound cost; not tied to product metrics
Manual eval scripts	Offline	Human or model graders score offline	Model selection, regression testing	Not live — can’t self-correct individual runs

The honest comparison:

vs LangChain self-reflection: The same-context-window problem is real. Outcomes’ isolated grader context is the architectural advantage.
vs AutoGen debate: Debate is for exploring reasoning paths. Outcomes is for hitting concrete quality bars. Different problems.
vs Manual eval scripts: Don’t replace your offline eval suite. Use the same rubric objects in both places — offline for stress-testing and calibration, live as Outcomes for in-flight correction.

What This Means for You

If you’re shipping production agents on Anthropic’s stack: Adopt Outcomes for any task where quality can be specified in a rubric. Start with one workflow (document generation is the sweet spot per Anthropic’s own data). Define a tight rubric. Run a 30-day A/B against the same agent without Outcomes. If you see lift, expand to a second workflow.

If you’re on LangChain or AutoGen and not on Anthropic’s Managed Agents: Don’t migrate just for Outcomes. But do borrow the architecture — a separate grader call with its own context window, against an explicit rubric, with a bounded iteration loop. That pattern transfers to any framework. The reason Outcomes wins isn’t the platform; it’s the architecture.

If you’re early in your agent journey: Skip Outcomes for now. You need to ship something first. Once you have a working agent in production and you’re trying to push reliability from 70% to 90%, that’s when Outcomes pays off. Adding it at the start is premature optimization.

If you’re a product manager evaluating agent products: When a vendor says “our agent has 95% accuracy,” ask them: against what rubric, judged by whom, on what dataset? The Outcomes architecture is becoming the new industry baseline. Vendors that can’t articulate a rubric are selling vibes.

The Bottom Line

The interesting thing about Outcomes isn’t the +10 points — it’s that those points come from architecture, not from a bigger model. Same Claude, same prompts, same base capabilities. The lift comes from putting structure around the work: explicit rubric, isolated grader, bounded loop.

That has implications beyond Anthropic’s stack. The teams that figure out how to specify quality in rubrics will outship the teams that keep tweaking prompts. The teams that treat rubric design as a real engineering discipline — versioned, reviewed, tested offline before going live — will widen the gap.

If you want to actually build production agents (not just experiment with them), our AI Agents Deep Dive course covers the architecture decisions that matter — when to use Outcomes-style grading, when to use Dreaming, when to fall back to simpler patterns, and the deployment gotchas that hit every team in week three.

Sources: