Amazon’s Bryan Silverthorn targets “trustworthy” AI agents by ditching static eval scores

At VB Transform 2026, Amazon’s framework reframes reliability as consistency, robustness, predictability, and safety, with sandboxed review gates.

ByOmar Al-BalawiTechnology Correspondent, The Executives Brief

about 15 hours ago·4 min read

Amazon’s Bryan Silverthorn targets “trustworthy” AI agents by ditching static eval scores

Executive summary

Bryan Silverthorn, director of the AGI Autonomy research lab at Amazon, will present Amazon’s framework for engineering trustworthy AI agents at VB Transform 2026. The approach pushes beyond single-agent performance benchmarks toward decoupled, human-reviewed interactions and multi-tool architectures that can self-correct.

AI agents are getting better at doing real work on their own. But permissioning them to touch enterprise systems is where projects stall. At VB Transform 2026, Amazon’s Bryan Silverthorn, director of the AGI Autonomy research lab, is putting a stake in the ground: the industry has been measuring “performance” in ways that do not actually answer the question that CIOs and security teams care about, namely whether an agent will behave reliably across the messiness of the real world.

Silverthorn told VentureBeat that common industry standards lean on EVAL scores, which are a static snapshot of performance. Those scores do not measure overall reliability in a way that covers predictability across prompts, environments, and input types. In other words: an agent can look good in an evaluation suite and still act unpredictably when the prompt style changes, the system conditions differ, or edge-case inputs show up. That mismatch is the “trust gap” Amazon is trying to close with a structured framework that focuses on consistency, robustness, predictability, and safety.

So what is Amazon actually moving toward? Rather than assuming that models can be harnessed into safety, Silverthorn described an emphasis on decoupled systems. The key idea is to separate the agent’s proposal from the moment it gains authority. Instead of granting direct access and hoping guardrails will be enough, Amazon’s approach uses sandboxed environments where agents propose changes that are reviewed by humans before implementation. That review gate is especially important in sensitive domains like finance, where the cost of an agent going wrong is not theoretical. The goal is verifiable interactions, meaning you can inspect what the agent wants to do and not just trust that it “probably” should do it.

This matters because the boardroom question is not just “Can it do the task?” It is “Can we govern it?” VentureBeat’s Q2 Pulse Research survey of over 100 senior technology leaders and buyers found that just 4% said they are comfortable relying on model guardrails alone. And when respondents were asked what worries them most about model guardrails, 40% cited unauthorized access to tools or data, while 27% pointed to prompt manipulation or injection. Those concerns map directly to why a static EVAL score can be a false comfort. Guardrails that look fine under testing conditions may not address how an agent behaves when it is confronted with adversarial prompts or when it attempts to operate across tools under imperfect real-world constraints.

At VB Transform, Silverthorn’s session is titled “Closing the capability-reliability gap: Inside Amazon’s framework for engineering trustworthy agents.” He will share details of Amazon’s approach to trustworthy agentic AI and how companies can move from single-agent wrappers to multi-tool architectures. The difference is more than architectural plumbing. Multi-tool architectures are designed to let agents self-correct mid-execution, which is an operational path to reliability. Instead of treating the model as a single shot, you build systems where the agent can revise course as it gathers information from multiple tools and steps, and where safety controls can stay in the loop before irreversible actions happen.

There is also a regulatory undertone to everything here, even though this specific interview is focused on engineering. Across industries, regulators tend to push for more than “best effort.” They want traceability, accountability, and controls that can be explained. A decoupled design with sandboxed review can make audits less hand-wavy, because it creates artifacts for oversight: what was proposed, what was reviewed, and what was implemented. That is the kind of structure that tends to survive procurement, security reviews, and policy scrutiny better than “trust the model” narratives.

Second-order, the shift from EVAL-centric metrics to reliability frameworks can change how leadership teams prioritize investment. If reliability is measured by consistency, robustness, predictability, and safety, then teams may need to fund evaluation redesign, new sandbox workflows, human-in-the-loop operations, and multi-tool orchestration. That is not free. But it may be cheaper than the alternative, which is pilots that stall during permissioning, because the business can deliver a demo while IT blocks production.

Meanwhile, the conference is also signaling that this is not only an Amazon problem. Another agentic ops and evals-focused session at VentureBeat’s flagship conference, happening July 14 and 15 in Menlo Park, will feature Manasi Joshi, director of systems intelligence and machine learning at Waymo. Her session is titled “Intelligence at scale: How Waymo builds safe, efficient AI for the physical world.” The throughline across these topics is safety at scale, not just capability at scale.

For decision-makers building or buying agentic systems, Silverthorn’s core message is a pressure test: “trustworthy” is not a vibe, and it is not a single score. It is a system property built through measurement that reflects real conditions, architecture that limits authority, and workflows that make risky actions reviewable. If you are trying to get from sandbox to enterprise permissions, the difference between a wrapper and a governed multi-tool architecture could be the difference between a model that performs and an agent the organization can safely deploy.

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedamazon ai-agents trusted-ai machine-learning-evals enterprise-security human-in-the-loop agentic-ai vb-transform-2026 multi-tool-architectures prompt-injection

Amazon’s Bryan Silverthorn targets “trustworthy” AI agents by ditching static eval scores

This story's Key Insights and Take-aways are locked.

More in Technology

CATL’s Robin Zeng says solid-state EV batteries hit level four, not 2030

OpenAI quietly upgrades free GPT-5.5 in ChatGPT for better context understanding

South Korea’s AI-chip boom is now driving property prices and developers’ bets