Xiaomi’s HarnessX rewrites its own AI harness mid-task, boosting small models up to +44%
A trace-driven “harness foundry” lets enterprise agents evolve their scaffolding autonomously, not by hand.

Xiaomi researchers introduced HarnessX, a framework that treats the AI harness as a modular, first-class object and autonomously improves its code using execution traces. For enterprise decision-makers, it shifts the bottleneck from “bigger models” to “better scaffolding,” with results like +14.5% average gains and +44% on embodied planning for Qwen3.5-9B.
Xiaomi’s HarnessX does something most enterprise AI teams do not: it changes the rules of the game while the agent is playing it. Instead of using a static, hand-crafted harness to connect a backbone LLM to tools, memory, and control flow, HarnessX treats the harness as a modular object that can be rewritten from execution data. In practical tests, the harness evolution produced an average +14.5% performance gain across 15 model-benchmark combinations, and for the open-weight Qwen3.5-9B it reached +44% on embodied planning tasks.
The immediate implication is brutal in the way only engineering bottlenecks can be brutal: sometimes scaling the foundation model is not the best path to better AI, and for smaller models it may not be the best path at all. HarnessX’s reported results argue that the “harness” itself can be the limiting factor for long-horizon enterprise AI agents, especially when tasks span many steps and require tool use, memory management, and careful control flows that tend to break or degrade when you tweak anything. HarnessX aims to remove that ceiling by letting the harness adapt based on what actually happens during execution, rather than relying on manual improvements that never fully learn from the traces they collect.
To understand why this matters, you have to picture what a harness does in an agent system. The harness is the operational layer that converts raw model outputs into structured, executable behaviors. It includes the prompts and templates, external tool integrations, memory management, and the control flows that determine how an AI system observes the environment, reasons through problems, and takes action. As agents take on more complex, long-running workflows, harness engineering becomes foundational, but it is still not treated like a mature discipline.
HarnessX identifies three key failure modes of today’s harnesses. First, harnesses are often static and hand-engineered. Any shift in the underlying foundation model, new tool introductions, or a move into a different operational domain can trigger bespoke manual rewrites. Second, harness code is frequently architecturally entangled, meaning prompt templates, tool wrappers, retry policies, and memory management are tightly coupled in ways that make small changes risky. Third, the harness and model are optimized in isolation: when engineers run tests to improve the harness, execution traces are commonly discarded rather than turned into training data that could improve the model’s future behavior.
HarnessX’s pitch is to fix the harness engineering bottleneck with a “unified harness foundry.” The core move is treating the harness as a first-class object: independently serializable, modular, and substitutable. HarnessX separates model configuration from harness configuration, so engineers can swap and evolve scaffolding without touching the underlying model. Inside the harness, behaviors are decomposed into components like context assembly, memory management, tool ecosystems, control flow, and observability. Each behavior is implemented as a “processor” that plugs into lifecycle hooks of the harness. That modular architecture matters because it makes safe evolution possible: processors can be swapped, added, or removed without silently breaking the rest of the pipeline.
To automate how this modular system improves, HarnessX introduces AEGIS, a trace-driven evolution engine that frames harness adaptation as a reinforcement learning problem over symbolic components of the harness. The point here is not just “try stuff and see what works.” The researchers explicitly call out three pathologies that can happen in RL-style editing: reward hacking, where the system exploits shortcuts; catastrophic forgetting, where an edit fixes one domain but breaks another; and under-exploration, where the system gets stuck in tiny prompt tweaks rather than exploring structurally better tool configurations.
AEGIS is designed around full trace observability and a four-stage pipeline. Digester compresses execution traces into structured summaries to locate failures. Planner analyzes those summaries to push exploration toward structural changes. Evolver generates code-level harness edits and runs tests to ensure changes work before deployment. Finally, Critic and gate checks for reward hacking and uses a deterministic gate to reject any update that regresses previously solved tasks, aiming to prevent catastrophic forgetting. This is what makes the “harness rewrite mid-task” idea more than a marketing headline. It tries to control the failure modes that would otherwise make autonomous evolution dangerous in production.
HarnessX also leans on harness-model co-evolution, and that is the part that could change how teams allocate effort. The researchers argue that optimizing either component in isolation hits walls. Evolving only the harness can hit a scaffolding ceiling if the model cannot use the new tools effectively. Training only the model can hit a training-signal ceiling if the harness never prompts it to exploit advanced capabilities. HarnessX interleaves harness evolution with model training by converting execution traces generated during harness adaptation into reinforcement learning signals for the foundation model. Every time the harness improves, the model simultaneously learns to better exploit that new strategy, breaking capability ceilings that traditional AI agent workflows often leave intact.
They make this co-evolution operational through cross-harness GRPO, a group relative policy optimization approach. When fine-tuning, cross-harness GRPO pools an agent’s execution trajectories for the same task across entirely different versions of the application harness. That allows the model to internalize higher-level strategy shifts, like using a new API endpoint or managing an execution budget, rather than only learning minor prompt-phrasing variations.
On validation, HarnessX was tested across five benchmarks spanning software engineering, multi-turn customer service dialog, web navigation, open-ended multi-step reasoning, and embodied planning. The setup splits roles: a “meta-agent” powered by Claude Opus 4.6 analyzes logs and writes code to evolve harnesses, while task agents run the workflows. To demonstrate model-agnosticism, the researchers test three different worker models: Claude Sonnet 4.6, GPT-5.4, and the open-weight Qwen3.5-9B. HarnessX is compared against a static harness baseline, representing typical enterprise deployments using hand-crafted frozen setups, and against the Claude Code SDK baseline, which represents a single-agent evolver to see if a simple single-model code iteration beats the full four-stage AEGIS pipeline.
The strategic stakes for enterprises and investors are straightforward: if the harness is a critical ceiling, then budgets that currently flow into bigger foundation models may be partially misallocated. Smaller models that already look “economically constrained” could become more capable through automated harness evolution. Meanwhile, teams building AI agents should expect a shift in what “performance improvement” means: not only prompt tuning or model selection, but also building harness pipelines that can learn from traces without collapsing into brittle, entangled code. HarnessX’s reported gains suggest the next competitive edge may be the software scaffolding itself, and the organizations that treat harnesses like first-class, evolvable artifacts could outpace those still iterating by hand.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

John Carmack apologizes for Quake burnout after Sandy Petersen said it “ruined id Software”
The 30th anniversary spark turned into a rare founder-to-founder reckoning on incentives, intensity, and a “Doom++” path not taken.

Vladimir Fedorov says June was GitHub Copilot’s best month ever after billing change
Usage jumped after GitHub switched Copilot from flat per-user pricing to billing based on how much developers use it.

Alibaba’s QwenAgentWorld trains models to predict environments, not act, and boosts 7 benchmarks
Qwen-AgentWorld flips agent training on its head by learning what environments will return next, then testing transfer across seven domains.
