Amazon’s Bryan Silverthorn targets “trustworthy” AI agents by ditching static eval scores
At VB Transform 2026, Amazon’s framework reframes reliability as consistency, robustness, predictability, and safety, with sandboxed review gates.

Bryan Silverthorn, director of the AGI Autonomy research lab at Amazon, will present Amazon’s framework for engineering trustworthy AI agents at VB Transform 2026. The approach pushes beyond single-agent performance benchmarks toward decoupled, human-reviewed interactions and multi-tool architectures that can self-correct.
AI agents are getting better at doing real work on their own. But permissioning them to touch enterprise systems is where projects stall. At VB Transform 2026, Amazon’s Bryan Silverthorn, director of the AGI Autonomy research lab, is putting a stake in the ground: the industry has been measuring “performance” in ways that do not actually answer the question that CIOs and security teams care about, namely whether an agent will behave reliably across the messiness of the real world.
Silverthorn told VentureBeat that common industry standards lean on EVAL scores, which are a static snapshot of performance. Those scores do not measure overall reliability in a way that covers predictability across prompts, environments, and input types. In other words: an agent can look good in an evaluation suite and still act unpredictably when the prompt style changes, the system conditions differ, or edge-case inputs show up. That mismatch is the “trust gap” Amazon is trying to close with a structured framework that focuses on consistency, robustness, predictability, and safety.
So what is Amazon actually moving toward? Rather than assuming that models can be harnessed into safety, Silverthorn described an emphasis on decoupled systems. The key idea is to separate the agent’s proposal from the moment it gains authority. Instead of granting direct access and hoping guardrails will be enough, Amazon’s approach uses sandboxed environments where agents propose changes that are reviewed by humans before implementation. That review gate is especially important in sensitive domains like finance, where the cost of an agent going wrong is not theoretical. The goal is verifiable interactions, meaning you can inspect what the agent wants to do and not just trust that it “probably” should do it.
This matters because the boardroom question is not just “Can it do the task?” It is “Can we govern it?” VentureBeat’s Q2 Pulse Research survey of over 100 senior technology leaders and buyers found that just 4% said they are comfortable relying on model guardrails alone. And when respondents were asked what worries them most about model guardrails, 40% cited unauthorized access to tools or data, while 27% pointed to prompt manipulation or injection. Those concerns map directly to why a static EVAL score can be a false comfort. Guardrails that look fine under testing conditions may not address how an agent behaves when it is confronted with adversarial prompts or when it attempts to operate across tools under imperfect real-world constraints.
At VB Transform, Silverthorn’s session is titled “Closing the capability-reliability gap: Inside Amazon’s framework for engineering trustworthy agents.” He will share details of Amazon’s approach to trustworthy agentic AI and how companies can move from single-agent wrappers to multi-tool architectures. The difference is more than architectural plumbing. Multi-tool architectures are designed to let agents self-correct mid-execution, which is an operational path to reliability. Instead of treating the model as a single shot, you build systems where the agent can revise course as it gathers information from multiple tools and steps, and where safety controls can stay in the loop before irreversible actions happen.
There is also a regulatory undertone to everything here, even though this specific interview is focused on engineering. Across industries, regulators tend to push for more than “best effort.” They want traceability, accountability, and controls that can be explained. A decoupled design with sandboxed review can make audits less hand-wavy, because it creates artifacts for oversight: what was proposed, what was reviewed, and what was implemented. That is the kind of structure that tends to survive procurement, security reviews, and policy scrutiny better than “trust the model” narratives.
Second-order, the shift from EVAL-centric metrics to reliability frameworks can change how leadership teams prioritize investment. If reliability is measured by consistency, robustness, predictability, and safety, then teams may need to fund evaluation redesign, new sandbox workflows, human-in-the-loop operations, and multi-tool orchestration. That is not free. But it may be cheaper than the alternative, which is pilots that stall during permissioning, because the business can deliver a demo while IT blocks production.
Meanwhile, the conference is also signaling that this is not only an Amazon problem. Another agentic ops and evals-focused session at VentureBeat’s flagship conference, happening July 14 and 15 in Menlo Park, will feature Manasi Joshi, director of systems intelligence and machine learning at Waymo. Her session is titled “Intelligence at scale: How Waymo builds safe, efficient AI for the physical world.” The throughline across these topics is safety at scale, not just capability at scale.
For decision-makers building or buying agentic systems, Silverthorn’s core message is a pressure test: “trustworthy” is not a vibe, and it is not a single score. It is a system property built through measurement that reflects real conditions, architecture that limits authority, and workflows that make risky actions reviewable. If you are trying to get from sandbox to enterprise permissions, the difference between a wrapper and a governed multi-tool architecture could be the difference between a model that performs and an agent the organization can safely deploy.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

CATL’s Robin Zeng says solid-state EV batteries hit level four, not 2030
The CATL founder sets a reality check for solid-state timelines and next-gen battery commercialization.

OpenAI quietly upgrades free GPT-5.5 in ChatGPT for better context understanding
The free model you use most gets smarter at tracking context, which reshapes product expectations across AI apps.

South Korea’s AI-chip boom is now driving property prices and developers’ bets
Nikkei Asia traces how demand around AI chip investments is spilling into real estate, reshaping risk for boards and lenders.
