GPT-5.5 tops Agents' Last Exam with 24.0% while Claude Fable 5 lands third at 22.0%

A new UC Berkeley ALE benchmark tries to measure real agent work, and even the best models still struggle.

ByOmar Al-BalawiTechnology Correspondent, The Executives Brief

about 4 hours ago·5 min read

GPT-5.5 tops Agents' Last Exam with 24.0% while Claude Fable 5 lands third at 22.0%

Executive summary

University of California, Berkeley's RDI and 300+ domain experts launched Agents’ Last Exam (ALE), a new benchmark for long-horizon, economically valuable AI agent workflows. The leaderboard shows OpenAI’s GPT-5.5 leading with a 24.0% pass rate, beating Anthropic’s Claude Fable 5 at 22.0%, but hardest tasks still hit 0.0% for many models.

Agents’ Last Exam (ALE) just dropped a benchmark that is less “cool demo” and more “can you actually do the job.” And on the new ALE Leaderboard, OpenAI’s GPT-5.5 (from April) running through the Codex harness took the top spot with a 24.0% pass rate.

That matters because the runner-up story is the rest of the leaderboard. Anthropic’s brand new Mythos-class Claude Fable 5, released just yesterday, scored 22.0% and landed in third place. Instead of proving that today’s most hyped models are ready for messy, real-world professional workflows, ALE is doing something more unsettling: it is forcing AI agents into longer, tool-using tasks and showing that performance ceilings remain painfully low.

So what is ALE, exactly, and why is everyone treating it like a stress test for the entire agent stack? Researchers from UC Berkeley’s Center for Responsible, Decentralized Intelligence (RDI), supported by an advisory committee of over 300 domain experts, designed ALE to close the gap between academic benchmark “wins” and labor impact that would actually show up in the economy. Historically, AI benchmarks have often been built around static question-answering or narrow, text-only environments. When agentic benchmarks started adding multi-step interaction, grading got sketchy. Automated verifiers can reject correct solutions, and some model families have been caught exploiting loopholes.

ALE’s big design choice is that it targets those loopholes directly. It forces models into a strict Generalist Computer-Use Agent (GCUA) framework. To pass, the agent cannot just fire off terminal commands and call it a day. Instead, ALE maps the workflow across five functional layers: Brain (reasoning), Eyes (visual perception), Body (orchestration), Hands (tool invocation), and Feet (runtime substrate). In practice, that means the agent has to navigate Linux or Windows virtual machines, interleave shell scripting with point-and-click actions inside heavy desktop software, and produce outputs that are evaluated against ground truth.

Even the grading philosophy is built to resist “LLM-as-a-judge” shortcuts. ALE uses that paradigm for only 6.8% of workflows. For tasks that require deterministic outputs, like generating a 3D mesh or parsing SEC filings, ALE relies on deterministic, code-based evaluation that compares the agent’s artifact to an expert reference. In other words, the benchmark tries hard not to reward fluent storytelling that looks right.

And ALE is not small. It launches with 1,490 task instances and is scaling toward a 5,000-task target. The tasks are grounded in the U.S. occupational taxonomy (O*NET / SOC 2018), spanning 55 non-physical industry sub-domains. The workflows are sourced from professional histories of industry practitioners, so you are not just seeing toy problems. The benchmark asks agents to do things like 3D model creation in Siemens NX, scene setup in Unreal Engine, neuroimaging analysis in FSLeyes, and visual effects compositing in Adobe After Effects. ALE also splits tasks into three difficulty tiers: Near-Term, Full-Spectrum, and Last-Exam.

Here is the leaderboard picture people will be talking about. Top 5 Agent Harnesses on ALE show that pass rates are still not where executives want them to be:

1) Codex (gpt-5-5) - 24.0% pass rate, 42.8% mean score 2) Ale Claw (gpt-5-5) - 23.0% pass rate, 45.8% mean score 3) Claude Code (claude-fable-5) - 22.0% pass rate, 40.5% mean score 4) OpenClaw (gpt-5-5) - 21.1% pass rate, 41.0% mean score 5) Cursor CLI composer-2-5 - 20.4% pass rate, 38.5% mean score

That looks like a “race,” but the story gets darker at the hardest level. On the Last-Exam tier, representing the frontier of professional difficulty, most configurations including Anthropic’s older Claude Opus 4.8 and Google’s Gemini CLI record a devastating 0.0% pass rate. So yes, GPT-5.5 is on top, but “top” still means “frequently failing” when you demand genuine long-horizon task execution.

There is another reason ALE is pulling real attention from buyers and boards: contamination and trust. One persistent failure mode in AI evaluation is benchmark contamination, where test items leak into training data lakes. If a model memorizes the benchmark, the evaluation stops measuring problem-solving and starts measuring recall. ALE addresses this with a dual-use deployment strategy: it operates as an open-source research initiative but closely guards its evaluation data. Only about 10% of the dataset (roughly 150 tasks) is released publicly on platforms like GitHub and Hugging Face, while more than 1,300 tasks stay private.

The project is designed as a “living benchmark,” with private tasks rotated into the public pool over time and retired public tasks swapped out. This rolling release aims to keep the evaluation surface fresh across successive model generations. ALE also adds transparency by tracking both “Full” and “Unlicensed” scores. The “Full” leaderboard includes tasks that rely on commercial CAD tools, paid APIs, or licensed datasets. The “Unlicensed” tier drops those license-gated tasks to provide a like-for-like comparison using only freely available tools, so models are not rewarded simply because they can access proprietary software environments.

Finally, the benchmark is already being framed by its own contributors as a reality check for the current ecosystem. UC Berkeley RDI data contributor Zengyi Qin (an MIT PhD researcher and data contributor to the project) announced the launch on X, writing: “Introducing Agents’ Last Exam (ALE),” noting “Built by 300+ domain experts from 100+ institutions. Covering 55 industry domains. Claude Opus 4.8 has 0.0% pass rate on the hardest subset.” He also shared a follow-up post linking an arXiv paper and crediting leads @YiyouSun, @Xinyang_Han_, @dawnsongtweets, and @BerkeleyRDI.

For executives and investors, the second-order implication is simple: ALE is not just another leaderboard. It is a tool for interrogating claims that agents are nearly production-ready. When even the best-performing configurations struggle on Last-Exam tasks, the question shifts from “who demos best” to “who can reliably execute multi-step work under real constraints,” with evaluation data that is harder to game and harder to memorize.

The strategic stakes are high because companies are already betting serious capital on AI agents. ALE is pushing the industry toward measurements that resemble workplace reality. Until pass rates move meaningfully on the hardest tier, the sobering leaderboard is a receipt, not a rumor.

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedai-agents benchmarks openai anthropic uc-berkeley evaluation responsible-ai enterprise-ai

GPT-5.5 tops Agents' Last Exam with 24.0% while Claude Fable 5 lands third at 22.0%

This story's Key Insights and Take-aways are locked.

More in Technology

Framework delays Laptop 13 Pro by a month, shifting July buyers into August

Apple’s new Siri AI cuts the chat short, and that actually changes the vibe

Chaotic Eclipse drops RoguePlanet: a seventh Windows zero-day grants SYSTEM on patched 10/11