Claude's 'Dreaming' Made Harvey's Agents 6x Better

Anthropic shipped 'Dreaming' for Claude Managed Agents in research preview. Harvey hit 6x completion rates. The 4-question Q3 routing gate.

The headline number is unusual: 6x. Not 60% better, not double — six times the task completion rate. That’s what legal-AI company Harvey reported after turning on a feature Anthropic shipped to Claude Managed Agents on May 6, called Dreaming.

A 6x number this clean usually means one of three things. Either the baseline was very low and almost any improvement looked dramatic. Or the metric is being measured generously. Or the new feature is doing something legitimately different from what existed before.

In Harvey’s case, it’s the third. Dreaming is a between-session memory-curation process. While the agent isn’t actively running a task, the system reads back through its past sessions, looks at what got done, what failed, what patterns repeated, what shortcuts worked, what file-handling hack got rediscovered for the seventh time — and writes those patterns back into the agent’s memory store as compact, retrievable insights. Anthropic calls it “governed self-improvement,” which is a polite way of saying: the model is allowed to update its own working memory, but only the parts you let it.

If you run a platform team, an AI engineering team, or any team that has agents in production, the question your CTO is going to ask you on Monday is: should we request research-preview access this week, or hold for general availability? And if we turn it on, what does Harvey’s 6x actually translate to for our shape of work?

Anthropic’s May 6 blog post — “New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration” — the four-feature drop that positions dreaming as the headline research preview alongside three public-beta companions Source: Claude Blog

Here’s the read.

What “dreaming” actually is, in production terms

The name is poetic; the implementation is mechanical. Three things happen during a dreaming pass:

It reviews past agent sessions. Whatever the agent did over the last N sessions — tool calls, intermediate reasoning, final outputs, error states — gets read by a separate process whose only job is to look for patterns. It is not the agent itself examining its own thoughts in real time. It’s a curator pass that happens between active runs.

It writes consolidated insights back to the agent’s persistent memory store. Recurring mistakes get tagged (“don’t try to read PDFs with the .doc handler again”). Workflows that multiple agent sessions converged on independently get promoted (“when summarizing legal briefs, lead with the holding, then procedural posture, then facts”). Style preferences get codified (“the team formats date references as YYYY-MM-DD, not ‘May 9, 2026’”). Stale or contradicted memory entries get pruned.

It surfaces a diff for inspection. Anthropic’s framing is explicit: developers can review what dreaming proposes to add, modify, or remove from memory before it commits. You can also choose to let it auto-apply. The control plane is yours.

What it is not: continuous self-modification, autonomous re-architecting of the agent’s instructions, or anything that touches model weights. It’s strictly about the agent’s working memory — the persistent notes the agent reads at the start of each new session.

Three sibling features shipped the same day, and you should keep them straight. Outcomes is a goal-conditioned execution mode where the agent works toward a rubric and a separate grader (running in its own context) decides when the work is acceptable. Multi-agent orchestration lets a lead agent break a task into sub-tasks and dispatch them to sub-agents with isolated contexts. Memory is now in public beta. Of the four, dreaming is the only one in research preview — meaning you have to request access.

Simon Willison’s live blog of Code w/ Claude 2026 on May 6 — the in-person dev-press source for the dreaming launch context, including the talk-by-talk capture of how Anthropic positioned the four-feature drop Source: Simon Willison’s live blog

What Harvey’s 6x is, and isn’t

Harvey is a legal-AI platform whose agents handle long-form drafting (briefs, motions, M&A documents), document creation across formats, and research workflows. Anthropic’s case-study language is specific: agents started “remembering filetype workarounds and tool-specific patterns” between sessions. That’s the consolidation working: the agent that had to re-discover the right way to extract text from a scanned PDF on Monday now has that knowledge written down on Tuesday.

The 6x number is almost certainly an upper bound for one specific reason. Harvey’s workload has the three preconditions that make dreaming pay off most:

Repeated workflow patterns. Agents work on similar matter types repeatedly. The same kinds of corrections and shortcuts surface across hundreds of sessions. That’s the dreaming consolidator’s favorite environment.

High repeated-mistake rate at baseline. A frontier-model agent operating in a niche domain (legal) without memory will re-make the same domain-specific mistakes over and over — wrong file format, wrong citation style, wrong document structure. Each one is a completion failure. When dreaming kills 80% of the repeats, completion goes up sharply.

Long-running session structures. Harvey’s agents work on matters that span days or weeks. Memory persistence across that time horizon is high-value because the agent gets multiple “wake up” cycles to apply what dreaming consolidated.

If your team’s agent stack doesn’t share all three, expect a smaller multiplier. The honest distribution most platform teams should plan for:

  • 1.5x to 3x completion-rate improvement on typical eng-team agent stacks (coding loops, eval runners, ticket triage, support deflection) where workflows repeat enough that memory consolidation has signal but the agent isn’t operating in a niche-domain bubble like legal.
  • 30% to 60% cost-per-completion reduction. This is sometimes more interesting than completion rate — the same number of completions but with fewer retries and shorter chains because the agent stops re-discovering what it should already know.
  • Less than 1.2x improvement on stateless or near-stateless workloads — code review on isolated PRs, one-shot generation tasks, anything where the agent’s “memory” was never doing meaningful work to begin with.

If your team’s number lands in the 1.5x-3x band after a 5-day pilot, you’re getting Harvey’s headline benefit at your team’s actual scale. That’s the right expectation to set with leadership.

The 4-question Q3 routing gate

Before requesting research-preview access — and definitely before re-architecting your agent stack — run these four questions in this order.

1. Does your agent stack actually have a memory layer today?

A surprising number of “agents” in production today are stateless prompt chains with a router and some tool calls. Dreaming consolidates persistent memory; if you don’t have persistent memory, dreaming has nothing to consolidate.

Specifically: does each agent session start by reading from a memory store (a vector DB, a key-value store, a structured JSON sidecar) that was written to by a previous session? If yes, dreaming has a job. If no, your Q3 task is to build the memory layer first; dreaming is the next thing after that.

The fastest test: run one of your agents twice on the same task family with a 24-hour gap. Does it learn anything from run 1 that affects run 2? If “no, it does the task fresh both times,” you don’t have a memory layer to consolidate.

2. What’s your agent’s repeated-mistake rate?

Pull eval-suite traces from the last 30 days. Group failures by root cause. What percentage are repeats — same error type, same trigger pattern, same workaround?

  • Under 20% repeat rate: Dreaming gives marginal gains. You’ll move from 80% completion to maybe 84%. Worth doing, but not the right Q3 priority.
  • 20-40% repeat rate: Dreaming gives meaningful gains. Plan for 1.5x-2x completion-rate improvement. Worth a serious pilot.
  • Over 40% repeat rate: Dreaming compounds with the May 6 Tier-1 Opus rate-limit boost into a 2-3x effective cost-per-run improvement. This is the band where the multiplier is large enough to justify a partial re-architecture if needed.

If you don’t have eval-suite traces with failure-cause tagging today, building that capture is the prerequisite to using dreaming at all. Without the trace data, you can’t measure whether dreaming helped.

3. Are you Anthropic-locked or model-portable?

Dreaming is Claude-Managed-Agents-specific. There is no equivalent feature in Claude Code today (though the community-built grandamenium/dream-skill GitHub repo is a manual approximation), and no equivalent in OpenAI’s Agent Builder or Google’s Gemini Agent Framework as of this writing.

If your team’s stack runs through a model-routing layer that switches between Claude, GPT-5.5, and Gemini based on cost or task fit, dreaming locks you into the Claude path for the agents that benefit from it. That’s not necessarily wrong — but it’s a strategic decision, not a tactical one.

If your team is single-vendor on Claude already (Claude Code shop, Claude API for production agents, Claude Managed Agents for orchestration), dreaming has no portability cost to you.

If your team is deliberately multi-vendor for resilience or cost reasons, plan for two scenarios: (a) the Claude-routed agents get dreaming; (b) the others wait for an open-source equivalent (likely 6-12 months away based on the rate at which OSS frameworks have replicated Anthropic’s patterns historically — MCP took about 8 months from launch to mature ecosystem coverage).

4. Can you accept governance-by-developer-review at scale?

Dreaming surfaces a diff. Someone has to review the diffs that matter — particularly for production-critical agents where a wrongly-promoted memory entry could change behavior in ways that cost real money or break compliance.

If you have 1-3 agents in production, developer review of every dreaming diff is fine. If you have 20+ agents in production, the diff review itself becomes a Q3 ops job. Who owns it? What’s the SLA?

The auto-apply mode exists for a reason — most teams will end up using it for the bulk of their agents and reserving manual review for the high-stakes ones. Have that policy in writing before you ship dreaming to production.

The 3 “request access this week” patterns

Three workload shapes where the answer is “request access today and pilot in a 5-day window.”

Long-running coding-agent loops. Cursor, Cline, Aider, or your in-house equivalent ported to Claude Code — agents that revisit the same codebase across days, learn the codebase’s conventions, and accumulate “this file has weird import structure” or “the test suite hangs if you run more than 4 in parallel” knowledge. The codebase-specific consolidation is exactly what dreaming was built for.

Eval-suite runners that re-test the same scenario set across model versions. Your eval harness re-runs the same 200 scenarios every time you bump a model version or change a system prompt. Dreaming can consolidate “scenario 47 is flaky for non-feature reasons” and “scenario 113’s expected output drifted in the last 30 days” — pattern knowledge that takes a human-in-the-loop hours to maintain.

Customer-support deflection agents handling the same FAQ patterns repeatedly. The agent answers the same 50-question shape thousands of times. Dreaming consolidates “the right answer to category X has shifted in the last 30 days” or “the new product launch made FAQ #14 obsolete” without your team needing to manually retrain.

The 2 “hold for GA” patterns

Two workload shapes where the answer is “wait for general availability” — even if the rest of your stack would benefit.

Multi-tenant SaaS agents serving distinct customer cohorts. If your agents serve Customer A and Customer B and the memory consolidation could cross-contaminate (an insight learned from Customer A’s data informing Customer B’s outputs), the governance burden is too high for a research-preview product. Wait for Anthropic to publish the multi-tenant isolation guarantees explicitly.

Real-time / voice-front-end agents. Dreaming is a between-session process. Voice agents and real-time interactive agents don’t have meaningful idle windows — the next session starts seconds after the last one ends. The consolidation pass either doesn’t happen or happens too cheaply to matter. Build voice agents on outcomes + multi-agent orchestration; come back to dreaming when the session pattern is more batch-like.

What this can’t fix

Dreaming is a memory consolidation pass. It is not a solution for:

Bad initial agent design. If your agent’s tool routing is wrong, or its system prompt is contradictory, or its task decomposition makes no sense, dreaming will just consolidate the dysfunction. Garbage memory in, garbage consolidation out. Fix the agent first.

Hallucination in the underlying model. Memory consolidation does not change the base model’s tendency to make things up when uncertain. If your agent fabricates citations 5% of the time, dreaming might consolidate the pattern “this user wants confident answers” and the hallucination rate will go up.

Agents without persistent memory stores. Already noted in question 1, but worth repeating: this is the most common reason teams will be disappointed by dreaming in pilots. Build memory first.

Cost optimization for stateless workloads. If your agents are answering one-shot questions with no continuity, the May 6 Tier-1 Opus rate-limit boost matters more for your stack than dreaming does. The two compound for memory-heavy workloads; for stateless ones, only the rate-limit change applies.

The 4 signals to watch for the next 30 days

Anthropic’s first dreaming-GA timeline disclosure. The research preview is the canary. Anthropic typically moves features to public beta within 60-90 days when reception is positive. Watch the Code with Claude London event on May 19 for the first hint.

Reddit r/ClaudeAI and r/Anthropic 7-day production deployment reports. The community will publish deployment retrospectives starting Day 7 of access. Look for the workload-specific multipliers — those are your honest expected-value reference, not Anthropic’s case study.

The OSS LangGraph / CrewAI / AutoGen dreaming-equivalent emergence. Whichever framework ships first will define the “model-portable” version of this pattern. Whoever ships second will likely be best — the first version will get the architecture wrong.

OpenAI’s likely “Memory v2” counter-launch. OpenAI has telegraphed a memory-roadmap throughout Q1-Q2. A Q3 OpenAI counter-launch is high probability. The shape of that launch will tell you whether the memory layer is becoming a category or whether it’s a Claude-specific differentiator.

The bottom line

Dreaming is a real production pattern, not a marketing flourish. The Harvey 6x is an upper bound — your team’s number will likely land between 1.5x and 3x on completion-rate, with cost-per-completion reductions in the 30-60% band, on workloads that have repeated patterns and persistent memory.

If your team passes the 4-question gate, request research-preview access this week and run a 5-day pilot with one production-tier agent and a clean A/B (dreaming-on / dreaming-off) on a 5-scenario eval suite. Decide on Day 6 whether to expand or hold.

If your team fails any of the four questions, your Q3 has a more important task than dreaming: build the persistent memory layer, capture eval-suite traces with failure-cause tagging, decide your model-portability stance, or write the diff-review SLA. Dreaming is downstream of those decisions.

The next 30 days will tell us whether the 6x was a one-off Harvey-shaped artifact or whether the median platform team can repeat the result. If you want a deeper play-by-play on shipping production-grade agent loops with Claude — including the memory layer, the eval harness, and the routing decisions — our agents deep-dive course walks through the full stack.

Sources

Build Real AI Skills

Step-by-step courses with quizzes and certificates for your resume