Sonnet 4.5 broke a JSON contract in production, and the rollback took requalifying everything
One model upgrade silently changed how requests were serialized, causing wrong filters, clarifying questions, and cascading downstream failures.

A team built an LLM system on Claude Sonnet 3.5 and upgraded through 3.7 and 4.0 without incident, then rolled out Sonnet 4.5 and saw production regressions where the model leaked structured payloads into the description field. For decision-makers, the consequence is clear: model upgrades can create an effectively unbounded “blast radius,” unless evaluation becomes the real specification.
Our system did one thing well: it turned natural-language questions into API calls. Users who were analysts, account managers, and operations leads would type requests like “Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city,” and the system would output a structured JSON object with fields like description, api_call, and post_body. The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend, used an LLM-generated JSON query to filter and shape the response, and delivered the result via email, a Drive document, or a chart in the browser. By mid-2025, the system was generating several hundred reports a month, and it had become the default way most teams pulled ad-hoc data.
Then came the Sonnet 4.5 rollout. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field, breaking the contract the rest of the system assumed. Two failure modes followed immediately. First, filter parameters never reached the API. Because the system treated post_body as the source of truth, the field came back empty, so the API call ran without the date range or region filter. Depending on the backend endpoint, that meant returning sales volume for all time or all regions, or returning a 500 error. Second, Sonnet 4.5 sometimes asked clarifying questions in its response. That was new behavior, and it was fatal because the system had no path for this. It had been built on the assumption that every model invocation would result in an API call, with no human-in-the-loop component and no state to hold a partially completed request. Downstream systems then broke in multiple ways.
The operational headline is “roll back,” but the real lesson is that rollback was harder than it should have been. The team rolled back from 4.5 to 4.0, but between the 4.0 and 4.5 deployments they had added new API integrations, and those integrations were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure. That’s the hidden bill you pay when the part you depend on is a black box. In traditional software, you can often bound the effect of a change by reading release notes, using unit tests, and sampling a predictable behavioral surface. With LLM-backed systems, the component producing your output is not under your control. A model version bump becomes a wholesale replacement of the functionality your system depends on. The authors call this an infinite blast radius: downstream effects cannot be enumerated in advance because both the input space (natural language) and the failure modes (what the model might do differently) are unbounded.
So what actually went wrong? The post-mortem revealed that the prompt had always been under-specified. The system asked the model to return a JSON object with three fields, and it described what each field was for. But it did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields. Earlier model versions inferred the constraint from context. Sonnet 4.5, evidently more “helpful,” chose a formatting behavior that made the response more useful to a human reader: it inquired for clarification or included the request body in description. From the model’s perspective, this was reasonable. From the system’s perspective, it violated the assumptions under which it was built.
The bug was not in the model. The bug was the assumption that the model would keep filling in specification gaps safely. This matters because many teams lean on schemas and “structured output” modes as if they guarantee safety. But schemas constrain syntax, not semantics. A schema can’t stop a clarifying question from appearing if the system has no path to handle it. It also can’t prevent silent defaults like a date range turning into all-time. Structured output modes and tool-use APIs might have caught the specific failure at the schema level, but the article notes they were not used for engineering reasons outside the scope of that piece. Either way, schemas would only solve the easier half.
That is why the authors shift to an evals-first architecture. In this framing, the evaluation suite is the formal specification, not the prompt. The prompt is an implementation of the spec, and the model is an interpreter. Evals are the spec itself, and any model or prompt change is valid only if it passes them. An eval is described as a triple: an input, a property the output must satisfy, and a scoring function. For their regression, they give an example test in Python: ensure the description field does not contain serialized payload artifacts like curl, post_body, braces, or URLs. They also emphasize that you want “a few hundred” such properties, combining hand-written invariants, regression tests derived from real production traffic, and LLM-as-judge scoring for fuzzier qualities like tone.
There are two non-glamorous constraints here that executives should care about. First, evals are expensive to build and maintain, and they drift as the product changes. Second, LLM-as-judge introduces variance: it can be inconsistent about what it considers acceptable. And no eval suite can protect you against categories you never thought to specify. The team’s specific blind spot was that nobody wrote an assertion like “the description field should not contain a curl command,” because nobody expected the model to put one there. Evals are not a silver bullet. They are a way to bound the blast radius using the only lever you have with a black-box system: densely sample the input-output behavior you actually care about, and refuse to deploy when it moves.
For boards, CIOs, and product leadership, the strategic stake is simple and uncomfortable: AI features are not just “new UI.” They introduce a dependency whose behavior can shift with model updates, formatting choices, and how the system decides to be “helpful.” Your reliability posture cannot be limited to prompt wording and traditional unit tests. In practice, you need evaluation to function like a gate before a model upgrade becomes an all-system behavior change. Otherwise, the next time an update “should be routine,” it might instead turn your default reporting path into a silent filter-wipe, a clarifying-question dead-end, and a rollback that forces you to requalify integrations you thought were stable.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

Stop Killing Games fights publishers over shutdowns that leave games unplayable
A new challenge targets the assumption that publishers can pull the plug without proving a game still works.

Audi’s Nuvolari hits 1,001 PS from a V8 hybrid, not electric, for €600,000
The fastest Audi production bid returns to gasoline muscle, with a 10,000 rpm V8 and only 499 built.

OpenAI ships Lockdown Mode to cut prompt injection leaks from ChatGPT
Lockdown Mode does not make ChatGPT invulnerable, but it aims to lower the odds sensitive data gets shared.
