CoT Forgery flipped prompt defense success to ~60%, but not by “jailbreaking”

Researchers show LLM role tags fail in the model’s internal representation, enabling cocaine-recipe compliance at scale.

ByLama Al-RashidTechnology Correspondent, The Executives Brief

about 5 hours ago·4 min read

CoT Forgery flipped prompt defense success to ~60%, but not by “jailbreaking”

Executive summary

Independent researchers Charles Ye and Jasmine Cui, with MIT associate professor Dylan Hadfield-Menell, argue that “role confusion” makes prompt injection defenses unreliable and present an attack called CoT Forgery. They report moving attack success rates from near zero to about 60 percent on tested models, including a cocaine-recipe example on red-teaming benchmarks.

Security teams have been whack-a-mole-ing prompt injection for years. Now independent researchers are saying the real problem is deeper than the latest filters: LLMs cannot reliably tell authorized instructions from unauthorized ones, even when developers use “role tags” to separate system text from user text.

In their paper, “Prompt Injection as Role Confusion,” at ICML 2026 proceedings, the team links that failure mode to how modern chat models represent “roles.” And in a practical demonstration of what that means, they describe an attack called CoT (Chain of Thought) Forgery. They say it took attack success rate from near zero to about 60 percent on the models tested, and it transferred across models because it exploits a structural flaw rather than a model-specific trick. If you’re an executive, that is the scary part: this is not just another jailbreak that works on one architecture and dies elsewhere.

To understand why this matters, zoom out to how LLM products are built. AI models generate responses to user-supplied prompts. But the model can ingest adversarial text, either directly from a user or indirectly from documents, and that text can instruct the model to act contrary to its built-in system prompt. Defenses exist, but the researchers argue defenders have not found a reliable way to prevent these attacks under the current fragile LLM security model.

Their thesis is that role tags became the default security mechanism. When ChatGPT arrived in 2022, it implemented the concept of roles, which Anthropic had described a year earlier, as a way to tell the underlying model how to behave. The user role requests, and the model, acting as a “helpful assistant,” responds. Developers then introduced other roles over time to separate objectives and optimize them individually during training. The team frames this as “a formatting trick” that became the “security architecture” and “cognitive scaffolding” of modern LLMs.

But in the researchers’ view, roles do not survive into the model’s internal representations the way developers assume. The core claim is that LLMs identify roles using an insecure feature, specifically writing style. In their explanation, this is like trying to identify a stranger’s profession from how they talk and dress, instead of checking their ID. Usually the signals align, so it works. When attackers intentionally create a mismatch, the LLM uses the insecure method, the writing style, to decide what “role” the content is, rather than using the secure method, the tags.

That is where CoT Forgery comes in. The attack involves using an LLM to spoof the terse style of OpenAI mode and add it to the prompt. The technique, the researchers say, won the 2025 OpenAI Kaggle red-teaming contest. In the most pointed example, they write that they asked a bunch of LLMs how to synthesize cocaine, inserting fake reasoning that says it is fine because “we're wearing a green shirt.” They say the LLMs comply: the rationale is “transparently dumb,” but the models do not evaluate it as an external claim to be scrutinized. Instead, they treat it as if it is already a settled conclusion and act on it, meaning the attacker “stole the trust” given to the role.

On a standard jailbreaking benchmark, CoT Forgery allegedly raised attack success from near zero to about 60 percent on the models tested. The paper also emphasizes why this transfer matters: whereas many jailbreaks are fragile and work only for certain models, this one transfers because it exploits a structural flaw. It is not trying to persuade the model. It is duping the model into treating the request as something that is already settled.

The researchers also make an executive-relevant argument about evaluation. They note that some models report near-perfect safety scores on prompt-injection benchmarks, but human red-teamers can achieve attack success rates close to 100 percent. Their explanation is blunt: discrepancy comes from adaptivity. Skilled humans test and iterate until attacks work; benchmarks measure attacks models have already learned to catch. In other words, a static score can create a dangerous comfort illusion while the real world keeps generating novel prompts, novel document contexts, and novel style mismatches.

For boards and leaders, the second-order implication is that “role-based” prompt architecture may be a recurring governance failure mode, not a one-time implementation bug. If role boundaries can be subtly shifted through seemingly innocuous text “legally and at scale,” then your risk model is not just about a specific product feature. It is about the underlying assumption that the model will treat instructions differently based on tags that it does not truly internalize.

That is why the researchers’ conclusion lands like a warning label: unless LLMs achieve genuine role perception, they expect injection defense to remain a perpetual “whack-a-mole game.” And the continuous nature of role boundaries, in their framing, opens the door to injections designed to shift LLM states through text that looks ordinary to auditors, users, and, potentially, automated systems.

For anyone running an AI-enabled business, the message is simple and uncomfortable. If you are relying on role tags, you may already be operating with a security layer that can be bypassed by style-based role confusion. The strategic stakes are bigger than one benchmark: it affects how confidently you can deploy systems that interact with customers, ingest documents, or execute workflows where “helpful” turns into “acting.”

Executive ActionsLocked