Claude Fable 5 blocks “Hello” in under 5% of sessions, Anthropic admits

Guardrails are throttling benign prompts and quietly falling back to Opus, sparking GitHub and social backlash.

ByKhalid Al-HarbiBusiness Desk, The Executives Brief

about 2 months ago·4 min read

Claude Fable 5 blocks “Hello” in under 5% of sessions, Anthropic admits

Executive summary

Anthropic's newly released Claude Fable 5 is refusing harmless questions because its safety guardrails trigger too conservatively. Decision-makers are facing a real operational risk: even a low false-positive rate can hit millions of users and erode trust in safety controls.

Anthropic is publicly acknowledging a problem that sounds small until you do the math: Claude Fable 5’s guardrails “sometimes catch harmless requests,” triggering on average in less than five percent of sessions. And in practice, customers and researchers say it can happen on their very first turn. Mike Famulare, principal research scientist at the Institute for Disease Modeling, part of the Global Health Division of the Gates Foundation, reports (#66657) that Claude Fable 5 in Claude Code emits model_refusal_fallback, a silent switch to Opus 4.8, on the first turn of essentially every session. His example is as basic as it gets: a session whose only user input is the word “hello!” with no repo content, no tool calls, and no file reads in context when the refusal fires.

This is why the “safety” story is turning into an availability story. Anthropic did not immediately respond to a request to quantify model refusals beyond its own statement. But the company also puts a number on the user scale in the real world: an estimated 18 to 30 million users worldwide. Even if the true false positive rate is at the lower end, that still means a meaningful chunk of people are being thwarted by prompts that should never be controversial. The complaint is not just theoretical. Bug reports have been filed in the Claude Code GitHub repo since Fable 5 debuted, including “[Bug] Fable 5 model safety filters causing false positives on benign messages #66587,” “Fable 5 refuses to assist with ‘Application Security Architect resume’ editing #66655,” and “[Feature Request] Allow Fable 5 usage for non-research lab management systems #67062.”

To understand why this is happening, it helps to see what Anthropic is trying to optimize. The Register reports Fable 5 is unusual because Anthropic has chosen to conceal safety interventions that attempt to block rival frontier model development. In other words, the system is not only trying to stop harmful inputs, it is also trying to limit the effectiveness of counter-competition surveillance by design. Per the company’s system card [PDF], those classifiers that detect cybersecurity, biology, and chemistry, as well as attempts at distillation, can fall back on the latest Claude Opus model, and the user gets notified. But the “counter-competition surveillance” effort is meant to limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

That last phrase matters because, as the article notes, “prompt modification” without notice can functionally resemble a man-in-the-middle attack, though Anthropic estimates it will “impact ~0.03 percent of traffic, concentrated in fewer than 0.1 percent of organizations.” On paper, that sounds like a rounding error. The catch is that the complaints are describing a different experience: refusals and degradations that may be invisible to the user, including no refusal and no notice in some cases. Developer Clay Merritt’s frustration is captured in the Register report: “Anthropic’s Fable 5 silently sabotages its answers when it detects AI/ML work. No refusal. No notice. Purposeful degradation invisible to the user.”

The social backlash is also getting specific about what gets caught. On X.com, Derya Unutmaz, an immunologist and professor at the Jackson Laboratory for Genomic Medicine, notes that “The word ‘cancer’ is flagged as a biosecurity risk by Claude Fable 5!” Similar complaints show up in Reddit threads. Taken together, these reports suggest the guardrails are not merely blocking rare extreme content. They are catching common, legitimate domain language and even single-word greetings. And the real-world harm of that kind of mismatch is not limited to annoyance. For security teams, researchers, and developers, repeated false positives can drive process workarounds, reduce adoption, and increase the time spent diagnosing whether the model is “wrong” or the safety layer is “overcorrecting.”

There is also a structural incentive behind Anthropic’s approach. The article says Anthropic expects cyber defenders and critical infrastructure providers to use its Claude Mythos 5 model, which shares the underlying model of Fable 5 but without the same safeguards. Access is not a free-for-all; doing so requires participating in the company’s Project Glasswing program or a trusted access program being rolled out for select biology researchers. That creates a two-tier ecosystem: one experience where frontier safety layers are strict and sometimes opaque, and another where a different model is offered for specific users under specific programs. Even if those programs are run in good faith, they raise governance questions. Who gets the “works normally” version, who gets the friction, and why.

Finally, this is where the story stops being a bug report and starts being a board-level decision. Devon (last name withheld by request), founder of Abliteration.ai, a service that assists with model abliteration (guardrail removal), tells The Register in a phone interview that there is “some degree of fearmongering and marketing hype” from big AI labs, but that legitimate concerns exist about how frontier models get used. His core point is blunt: Anthropic is “making a big bet on their brand that people will trust their brand so much they’ll just deal with [refusals].” In the long term, people may not accept companies centralizing control over what information they can access. Whether you agree with that critique or not, the operational evidence from Famulare and others is hard to ignore: even a conservative safety strategy can create a mass UX failure when it triggers on harmless inputs at scale.

For executives running AI programs, the second-order takeaway is straightforward. Safety is not just a compliance checkbox; it is a user-facing reliability layer. When guardrails silently fall back, or refuse benign prompts, the organization doesn’t just lose model accuracy. It loses confidence. And when confidence drops, even well-engineered safety frameworks can slow adoption across teams that need the system to behave consistently on “Hello,” on standard workflows, and on common domain vocabulary.

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedanthropic claude ai-safety guardrails false-positives model-fallback product-ux ai-governance developer-tools security

Claude Fable 5 blocks “Hello” in under 5% of sessions, Anthropic admits

This story's Key Insights and Take-aways are locked.

More in Business

Anthropic’s Levant Alpöge cracks the Jacobian conjecture after 87 years

Uber buys Delivery Hero for nearly $15B, vaulting to top food delivery outside China

Epic and Google drop settlement bid, forcing rival Android app stores by July 22