Stanford’s DeLM cuts multi-agent task costs 50% by removing the central orchestrator

A new decentralized framework coordinates agents via shared verified “gists,” improving accuracy and lowering inference cost on benchmarks.

ByHessa Al-FalehBusiness Desk, The Executives Brief

about 4 hours ago·4 min read

Stanford’s DeLM cuts multi-agent task costs 50% by removing the central orchestrator

Executive summary

Stanford researchers Yuzhen Mao and Azalia Mirhoseini introduce DeLM, a decentralized language model that coordinates multi-agent work without a central controller. The result: 10.5% better SWE-bench Verified performance and roughly 50% lower cost per task, plus top accuracy on LongBench-v2 Multi-Doc QA.

Stanford researchers Yuzhen Mao and Azalia Mirhoseini are poking a hole in a deeply held assumption about how today’s AI agents should be built: that they need a “boss” at the center. In their new framework, a decentralized language model (DeLM), agents coordinate directly through a shared knowledge base and a task queue, not via a main orchestrator that merges, filters, and rebroadcasts every intermediate result.

And the payoff is not just architectural purity. On SWE-bench Verified, DeLM performed 10.5% better than the strongest baseline and reduced cost per task by roughly 50%. The headline number matters because “agent orchestration” is not free. In centralized systems, every sub-agent’s progress (and every failure) must funnel through one controller, which adds coordination latency and forces repeated communication. DeLM’s bet is that when you let agents write to a shared state instead, you can save inference dollars while also improving outcomes.

So what exactly is the centralized assumption DeLM is attacking? In a typical centralized multi-agent setup, a main agent breaks a task into subtasks, assigns them to parallel sub-agents, waits for responses, merges and summarizes intermediate progress, and then launches the next wave of orders using the controller’s collected context. That pattern scales reasoning in a straightforward way, but Mao and Mirhoseini argue it scales poorly for two reasons.

First, as the number of subtasks grows, the controller becomes a communication and integration bottleneck. Every useful finding, partial finding, and failure has to be reported back, then interpreted and redistributed. Second, the orchestrator can “dilute, omit, or distort” useful information, leading to lost progress. That distortion risk grows when systems operate in long-context reasoning scenarios, because the controller is trying to compress, group, and cluster evidence before the system fully understands what’s actually relevant or correctly combined.

DeLM is designed to address that bottleneck by replacing centralized merging with shared progress. The framework is built around parallel agents, a shared context, and a task queue. Shared context is described as a curated store of “gists,” which are information summaries that other agents can use. Those gists can include verified and evidence-based findings, partial findings, documented failures, and pointers to detailed evidence agents can pull from when working on specific tasks. Instead of forcing every update through a main agent, DeLM uses a task queue where subsequent pending subtasks can be claimed independently by agents.

The workflow is intentionally simple: initialization breaks inputs into work units and adds them to a queue; agents execute in parallel while reading shared context; results are compressed into reusable gists and only fully verified gists are shared with the group. When the queue is emptied, the last agent to return inspects all shared context to determine whether more work is needed, then delivers the final answer. The core mechanism is that agents exchange progress through shared state asynchronously, and claim ready tasks without waiting on a single overloaded controller.

This changes how cost and accuracy behave as problems get bigger. The researchers argue DeLM can avoid redundant exploration and reuse discoveries and failures, so agents can focus on unresolved issues rather than repeatedly walking into the same dead ends. They also frame DeLM as especially relevant for software engineering test-time scaling, where models are given time to “think” to improve reasoning and problem-solving. In one example highlighted in the source, concurrent de-bugging becomes feasible: multiple agents explore hypotheses and reasoning paths in parallel while still sharing intermediate progress.

DeLM also targets long-context reasoning and multi-document question-answering. The point is that agents can simultaneously examine their own evidence clusters, while still maintaining a “global compact view” of accumulated evidence. But the framework’s shared context does not blindly dump everything into the shared space. It supports unfoldable gists, where agents see short summaries by default but can choose to “unfold” into more detailed summaries and raw evidence. That matters because there is a trade-off between sharing enough detail to stay accurate and flooding context windows so aggressively that coordination itself becomes another long-context bottleneck. As Mao said in the paper, if agents shared full traces, each worker would need to read long command histories, file dumps, failed edits, and intermediate reasoning. If they share only compact summaries, important details could be lost. Unfolding is presented as a coarse-to-fine, opt-in access strategy.

Now let’s cash out the benchmark claims. DeLM’s performance on SWE-bench Verified, which evaluates how well AI models and agents solve real-world software engineering problems, is described as 10.5% better than the strongest baseline. It also reduced cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, which assesses long-context LLM ability on real-world multi-document tasks, DeLM had the highest accuracy across four model families, including GPT-5.4, Claude Sonnet, Gemini Flash, and DeepSeek-V4-Pro. The implication for builders is direct: DeLM is not only theoretically decentralized, it is empirically cheaper and more accurate across task types and model families.

For enterprise builders, DeLM challenges a core procurement and design assumption: that every multi-agent workflow needs a central controller. If performance holds while costs drop and accuracy rises, the architecture you choose will show up in your unit economics. That becomes a board-level conversation too, because agentic systems affect inference spend, latency, and operational risk. There is also a second-order implication: if decentralized coordination reduces lost progress and avoids repeated reading of the same documents or re-running failed analysis, then your systems may become more robust under pressure, not just more impressive in demos.

In a world where inference budgets are tightening and teams are being asked to do more with less, DeLM is a reminder that the fastest way to scale agentic work might not be “add more agents,” it might be “stop routing everything through one choke point.”

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedstanford delm multi-agent-systems llm-cost software-engineering long-context inference-optimization ai-architecture

Stanford’s DeLM cuts multi-agent task costs 50% by removing the central orchestrator

This story's Key Insights and Take-aways are locked.

More in Business

SpaceX’s first options day breaks U.S. records after a $85B IPO win

SpaceX valuation surges to $2.6T, briefly overtakes Amazon as shares start trading

SpaceX vaults past Amazon in 3 days, briefly topples Microsoft, and enrages some bulls