On May 6, 2026, Anthropic used its Code with Claude developer conference in San Francisco to unveil "dreaming," a new capability for Claude Managed Agents that lets agents review their past sessions in the background, extract patterns, and improve over time without retraining the underlying model. For businesses running AI in production, this changes what an agent deployment can be expected to do six months after launch.
What Anthropic Actually Shipped at Code with Claude
Anthropic announced three additions to Claude Managed Agents at the conference, per the official Anthropic announcement and same-day reporting from SiliconANGLE on May 6, 2026.
Dreaming. A scheduled background process that reviews an agent's prior sessions and memory stores, extracts recurring patterns and mistakes, and writes them out as plain-text notes and structured playbooks that future sessions can read. The model weights are untouched. Dreaming is in research preview and gated to developers who request access.
Outcomes. A specification format that lets developers describe what "good" looks like for a task, including concrete examples. A separate grader agent then evaluates each run against the rubric. Outcomes moved from research preview to public beta on May 6, 2026.
Multi-agent orchestration. A primitive that lets a lead agent split a complex task across sub-agents with independent context windows. Also now in public beta, per the Anthropic post.
The three features were paired with a doubling of Claude Code's rate limits for Pro, Max, Team, and Enterprise users, and with a 300-megawatt compute partnership with SpaceX covering the Colossus 1 data center and over 220,000 NVIDIA GPUs. Anthropic also reported API volume up roughly 17x year-on-year on its platform.
No new model was announced. That is the point.
Why Self-Improving Agents Are a Strategic Shift
Most AI deployments today plateau at launch. The agent works, the team celebrates, and six months later the same agent is making the same mistakes because the model has no memory of what worked last quarter. Better prompts and fine-tuning help, but both require human-in-the-loop work, and neither captures the institutional knowledge that accumulates inside a real workflow.
Dreaming addresses that plateau directly. Dianne Penn, who leads product for Anthropic's research team, told VentureBeat that the company's measure of progress is "task horizon," meaning how long an agent can work autonomously before quality degrades. "This time last year, models could work for minutes," Penn said. "Now, most of us have agents running for hours on end."
Our take: The interesting thing about dreaming is not that the agent gets smarter. It is that the smarts compound on a per-deployment basis. Two companies running the same Anthropic model with the same prompts can end up with materially different agent performance after a few months of operation, because dreaming locks in workflow-specific learnings. That makes the agent layer behave more like a system of record than a stateless API call. It also makes it more expensive to migrate off, which is the lock-in story buyers should be paying attention to.
What the Customer Results Actually Tell Us
Anthropic disclosed three customer outcomes alongside the launch, all reported on May 6, 2026.
- Harvey, the legal AI company, reported that task completion rates rose roughly 6x in internal tests of dreaming. The gain came not from a model change but from agents carrying filetype workarounds and tool-specific patterns across sessions.
- Wisedocs, which handles medical document review, cut its document review time by 50% using outcomes.
- Netflix's platform team is using multi-agent orchestration to process build logs across hundreds of pipelines in parallel, surfacing only the patterns worth acting on.
These numbers are vendor-disclosed, so treat them as upper-bound claims for now. The signal that matters is the shape of the results, not the magnitude. A legal workflow improved from cross-session memory. A medical review workflow improved from a graded success criterion. A platform engineering workflow improved from parallel decomposition. Three different unlocks across three industries suggests the underlying primitives are general rather than tailored to a single use case.
We covered the broader pattern, why most agent deployments stall after a successful pilot, in why most AI projects stall between pilot and production. Dreaming, outcomes, and multi-agent orchestration are each direct attacks on a specific stall point.
The Governance Question Most Buyers Are Missing
Self-improving agents create a governance surface that does not exist with stateless model calls. When an agent writes a "playbook" during a dreaming run, that playbook is now part of the agent's behavior. It is also a piece of intellectual property, a potential source of bias, and a place where a bad pattern can quietly propagate across future sessions.
Three questions buyers should be asking their AI vendors right now.
- Who can read the playbooks? Are the curated memories visible to your team, or only to the model? If they are not human-readable, you have a black-box compounding effect. Anthropic's design uses plain text, which is the right answer for audit, but verify the implementation before signing.
- Who controls the curation? If a customer-support agent dreams a "shortcut" that violates your refund policy, who catches it? The governance posture has to include a review step for new playbook entries in regulated workflows.
- What is the egress story? If you decide to migrate to a different provider in eighteen months, can you export the accumulated playbooks in a usable format, or are they trapped on the platform? Memory portability will become the lock-in fight of 2027.
Our practical AI governance framework outlines how to add these review steps without slowing down deployment.
How to Pilot Dreaming Without Breaking Production
For most teams, the right way to evaluate dreaming and outcomes is not a greenfield rebuild. It is a targeted pilot on a single workflow where the current agent already runs and is known to plateau.
- Pick a workflow with volume. Dreaming compounds across sessions, so it needs sessions. A handful of high-stakes requests a week is the wrong test bed. Hundreds of similar requests is the right one.
- Define outcomes before turning on dreaming. Use the outcomes feature to write a success rubric. Without one, you cannot tell whether dreaming is improving quality or simply locking in a pattern that looks confident.
- Sandbox the playbooks. Run dreaming in a non-production branch first. Read every playbook entry the agent writes for the first two weeks. You are learning what the agent thinks it learned, which is often surprising.
- Instrument the workflow automation layer around the agent. The agent is one node in a larger process. Logging, retries, and human approval gates around it determine whether dreaming compounds your wins or your failures.
We wrote separately about the architecture of agent teams that outperform single AI tools, which is directly relevant given that multi-agent orchestration moved into public beta this week.
What Not to Do
Do not assume dreaming replaces fine-tuning. It does not. Dreaming is a memory and curation layer. If your task requires the model to internalize a vocabulary, style, or reasoning pattern, fine-tuning is still the right tool. The two are complementary.
Do not turn it on across every agent at once. A research preview is a research preview. Start in one workflow, measure honestly, and expand only after you trust the playbooks.
Do not skip the governance conversation. Self-improvement is the feature most likely to surprise your security and compliance teams in 2026. Brief them before deployment, not after.
Key Takeaways
- Anthropic announced dreaming, outcomes, and multi-agent orchestration for Claude Managed Agents at Code with Claude in San Francisco on May 6, 2026.
- Dreaming is in research preview; outcomes and multi-agent orchestration moved to public beta and are available to all developers on the Claude platform.
- Customer results disclosed by Anthropic include Harvey at roughly 6x task completion, Wisedocs at 50% faster document review, and Netflix using orchestration for parallel build log analysis.
- Self-improving agents introduce a new governance surface around memory curation, playbook review, and memory portability that buyers should address in vendor contracts.
- The right pilot is a single high-volume workflow with a defined outcomes rubric, sandboxed playbooks, and observability around the broader automation layer.
The businesses that move early on self-improving AI agents will have a meaningful advantage. If you want to be one of them, let's start with a conversation.