Claude 4.5 folded post_body into description, breaking JSON contracts and reports
A production LLM upgrade turned a “minor bump” into silent misfilters and downstream system failures, then demanded an eval-first reset.

A team building an LLM system for natural-language-to-API reporting built on Claude Sonnet (starting with 3.5, upgrading to 3.7 and 4.0 without incident) saw Sonnet 4.5 break a structured JSON contract. The consequence for decision-makers: model upgrades can create infinite blast radius unless you shift from prompt discipline to eval-gated behavioral guarantees.
The failure was specific, and the timing was deceptively ordinary: after upgrading their LLM from Claude Sonnet 4.0 to 4.5, a “meaningful percentage of requests” started behaving differently in ways their production stack was not built to handle.
In their system, users wrote plain-English requests and the model returned a structured JSON object with fields like description, api_call, and post_body. That contract let the rest of the pipeline do its job. But with 4.5, the model began folding the contents of post_body into the description field, and it also started asking clarifying questions in its response. In the first failure mode, filter parameters never reached the API because the system treated post_body as the source of truth and that field came back empty; the backend then either returned data for all time, data for all regions, or threw a 500 error. In the second failure mode, the model asked questions, and downstream systems broke because the system had no path for partial completion or human-in-the-loop handling.
To understand why this landed like a wrecking ball, remember what “worked” before. Their workflow was engineered like a deterministic tool: natural-language questions became API calls, the call went to the right backend (with integrations to internal reporting portals, Salesforce, and homegrown services), the model-generated JSON query filtered and shaped the response, and the output was delivered via email, a Drive document, or a chart in the browser. By mid-2025, it was generating several hundred reports a month and had become the default ad-hoc data-pull method for analysts, account managers, and operations leads, and it was used by leadership and external stakeholders.
Then the upgrades stopped being “library bumps” and started being functional replacements. Sonnet 3.5 was the baseline in early 2025; they upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, they had grown complacent about stability because the previous bumps behaved like expected changes. The 4.5 regression was not that the model was “wrong” in a conversational sense. It was that the system’s engineering assumptions were under-specified.
The post-mortem surfaced a key point: their prompt had always been under-specified. They told the model to return a JSON object with three fields and described what each field was for. But they did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields. Earlier model versions inferred this constraint from context. Sonnet 4.5 evidently became more “helpful” by providing the request body inside the description and, in ambiguous cases, by asking questions. From the model’s perspective, that behavior improves usefulness. From the system’s perspective, it violates the contract assumptions the pipeline was built on.
This is why the article keeps returning to an idea that should scare board members and engineering leaders alike: the “infinite blast radius” of LLM-backed systems. Traditional software engineering bounds change effects by construction. Unit tests and release-note reading assume the changed component is deterministic enough to sample or predict, so the blast radius can be constrained by design. With LLMs, the component producing the output is not under full control. You cannot diff a model version bump from 4.0 to 4.5 like you would diff a library update. Natural language inputs and model failure modes are both unbounded, so downstream behavior can spread in ways nobody can enumerate in advance.
They also call out a common trap for executives: assuming schemas solve the risk. The system used a structured JSON object, and it could have used tool-use or structured output modes that constrain the response format at the schema level. But the article makes the nuance clear: a schema constrains syntax, not semantics. Even with schemas, you still cannot specify that a clarifying question should never appear in a system with no clarification pathway, or that a date range should never silently default to all-time. In other words, format compliance does not guarantee behavioral compliance.
So what did they do after rolling back to 4.0? The operational pain was real: reverting a model required requalifying every newly added API integration against 4.0 under time pressure, because those integrations had been qualified against 4.5. That is the hidden cost of “model choice” becoming part of the product surface. Every time you swap the model, you are changing the behavior of the interpreter that turns user intent into system actions.
Their proposed discipline is evals-first architecture. Instead of treating the prompt as the formal specification, they treat the evaluation suite as the specification and the prompt as an implementation. An eval is an input, a property the output must satisfy, and a scoring function. They even sketch the kind of assertion that would have caught the 4.5 regression: check that the description field does not contain serialized payload content such as curl, post_body, “{”, or URL strings. In practice, they argue you build a gate with hundreds of such properties, combining hand-written invariants, real-traffic-derived regression tests, and even LLM-as-judge scoring for fuzzier qualities like tone.
But they do not oversell it. Evals are expensive to build and maintain, they can drift as the product changes, LLM-as-judge scoring introduces variance, and you can only catch failure modes you have thought to specify. They learned that explicitly: nobody had written an assertion like “the description field should not contain a curl command” because nobody had thought the model would output one. So the takeaway is not “evals will save you.” The takeaway is “evals are how you bound blast radius when the core function is a black box.”
For peers, the strategic stakes are straightforward. If your company uses LLMs to execute workflows, generate operational decisions, or assemble reports that others rely on, then a model upgrade is not a cosmetic change. It is a behavioral release train, and without eval gates, you can get silent misfilters, backend errors, or broken downstream logic that turns one “helpful” formatting choice into a production incident. The question for decision-makers is simple: are you managing the upgrade like a software bump, or like a new dependency with a contract you can prove at the behavioral level before it ships?
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

Instagram lets everyone rearrange profile grids, starting June 8
After nearly a year of testing, Instagram says it will roll out drag-and-drop grid reordering to Android and iPhone.

Jim Cramer: Nvidia sovereign AI could cut its hyperscaler dependence
The TV analyst argues Nvidia’s sovereign AI push diversifies demand, shifting power from mega-cloud buyers.

Harness-1 hits 73% recall, beating GPT-5.4 with a 20B open-source search agent
A new “state-externalizing” agent design lifts retrieval accuracy, and it’s already available under Apache 2.0.
