AWS warns AI agents can go “flying blind” without a sandbox
Amazon says the real failure is an intent-execution gap between models and the software that runs them in production.

Anoop Deoras, director of applied science for agentic AI at Amazon Web Services, warns that deploying AI agents without guardrails leaves teams “flying blind.” AWS researchers Gaurav Gupta and Vatshank Chaturvedi argue the fix requires rethinking the harness between models and tools, plus sandbox testing before production.
Anoop Deoras, director of applied science for agentic AI at Amazon Web Services, used a phrase that will make any production owner sit up: “we may be flying blind.” He said this when asked about what happens when AI agents are deployed without proper guardrails. AWS is preparing to publish research released Monday that tries to explain why that risk persists, and why the common instinct, “just monitor the model,” may miss the real problem.
Deoras’ core claim is straightforward but unsettling: when agents are left to reason too long without checking the actual environment, they compound errors. The research identifies what the authors call the intent-execution gap, a breakdown at the interface between an AI model and the “software harness” that executes the model’s instructions. In plain English, the AI can form internal assumptions about system state that quietly diverge from reality, then issue commands based on those wrong assumptions, and the longer the chain runs, the further it drifts.
This focus on “the harness” lands at a notable moment for Amazon, because the company has spent the past year pushing hard for AI adoption, including internal use cases that reportedly ran into trouble. The Financial Times reported that employees were caught running AI agents on hollow, meaningless tasks to climb an employee-built productivity leaderboard called KiroRank. Amazon shut KiroRank down on May 29, and told Fortune it was only in beta mode and used by some employees before it was shut down. Generally, Amazon said it measures token utilization to understand cost and efficiency patterns, but discourages using token utilization to measure developer productivity.
For executives, the KiroRank episode matters less because of any single leaderboard, and more because it illustrates a familiar incentive trap: metrics drift away from the thing they were meant to measure. AWS researchers, whose work they say was undertaken before KiroRank was shut down, argue the issue of gaming metrics is deeper than one internal ranking system. Their paper connects this incentive drift with benchmark gaming across the industry, including the term benchmaxing. Benchmaxing, as Deoras distinguishes it, is not burning extra tokens to improve a leaderboard. It is manipulating structural conditions under which evaluations happen, through factors like inference backend reliability, network bandwidth during software installation, and timeout policy settings. The researchers found these changes can swing results by 5 to 10 percentage points, entirely independent of what the underlying model can actually do.
The message is not subtle: if you evaluate and run systems without stable constraints, the system optimizes for the easiest measurable output, not the real-world goal. Deoras points out that controlling infrastructure norms improperly will not produce true gains, because real production constraints are different. The broader framing is essentially Goodhart’s Law applied twice in modern AI stacks: first at the organizational level with tokens and benchmarks, and then inside agents themselves with assumptions and execution. And unlike a leaderboard, an agent running blind in production may be harder to shut down the moment it starts drifting.
So what does AWS propose as the fix? Deoras and the researchers argue that the remedy is a sandbox, meaning a controlled environment where agents can test hypotheses, fail safely, and course-correct before they touch production systems. Deoras described this as analogous to responsible software engineering: dev environments and pre-production testing pipelines that exist to catch errors before they reach users. The point is not to replace humans, but to structure the workflow so humans are not drowned in constant corrections. When asked whether the harness is where the human enters the loop, Deoras said “yes and no.” Scientists building agents should understand what goes wrong when agents are deployed. But for consumers, AWS does not want to overwhelm them. In that model, humans set direction, agents execute, and sandboxes catch the errors in between.
This also serves as a competitive challenge to major model providers. Those companies publish benchmark scores using harnesses optimized for their own models. AWS’ research argues that a model-agnostic harness designed with design principles that work across Claude, GPT, Gemini, and Grok without model-specific tuning can match or exceed those scores. Deoras said, “Agent performance is really not locked into any single model provider,” opening the door to building applications without being constrained to a particular model. To support the claim, AWS is open-sourcing its framework called Simple Strands Agent, which the researchers say outperformed popular open-source alternatives across three major industry benchmarks.
There is one more strategic implication hiding under the technical details. The research argues that much of the industry’s progress has been brittle because optimizations can overfit to quirks of a specific model version and then evaporate when the model improves. As models get better, those behaviors change, making the gains noncompounding. AWS says the industry needs invariant principles that survive model upgrades, engineered into the harness instead of the model itself. Deoras even highlights a discovery that surprised him: despite differences in modeling philosophy, a common invariant property connects these models, emerging from observability traces.
For boards, CIOs, and anyone accountable for deploying AI agents in the real world, this becomes a governance question as much as a technical one. The AWS research points to a new bottleneck in agent rollout: teams are overwhelmed by model switching and re-architecting whenever models upgrade. If your organization is constantly rebuilding agent “harnesses” for each new model, you may be spending effort on the wrong problem, while agents still lack the sandbox and guardrails that prevent blind optimization and unsafe drift. The future Deoras describes is not unchecked autonomy. It is “humans to be in the driver’s seat to direct the work and then take the hands off,” and sandboxes that make that hands-off moment safer than it is today.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

Apple rebuilds Siri as “Siri AI” for WWDC 2026, adds more natural customizable voices
The company says it rebuilt Siri with AI at its core and is previewing it at WWDC 2026, starting a new voice era.

Meta pulls WIRED-identified face-recognition code from smart glasses’ Meta AI app
A quiet deletion in Meta AI changes the compliance and product-risk calculus for every wearable and AR player.

Brian Armstrong says Coinbase aims to keep token costs roughly flat via model routing
Coinbase CEO Brian Armstrong spells out how routing prompts to cheaper models may cap costs while token usage grows fast.
