Alibaba’s SkillWeaver cuts agent tool tokens 99.9% with SAD feedback loop

A new task decomposition and retrieval pipeline slashes context use while boosting multi-step tool-routing accuracy in enterprise benchmarks.

ByYousef Al-ZahraniTechnology Correspondent, The Executives Brief

about 2 hours ago·4 min read

Alibaba’s SkillWeaver cuts agent tool tokens 99.9% with SAD feedback loop

Executive summary

Researchers at Alibaba developed SkillWeaver, plus Skill-Aware Decomposition (SAD), to route multi-step agent tasks to the right skills instead of exposing the full tool library. The result: token consumption drops from an estimated 884,000 tokens per query to roughly 1,160, alongside large gains in decomposition accuracy.

Enterprise AI agents are headed toward a nasty bottleneck: as soon as you give them “hundreds of tools,” they can’t reliably decide which one to use for each step. The work then turns into a routing problem, not a reasoning problem. Alibaba’s research team is tackling that exact failure mode with SkillWeaver, a framework that builds an execution graph for a task and selects the right skills per node, instead of dumping everything into the model’s context.

And the headline number is real. In experiments described with a custom benchmark, SkillWeaver reduces estimated context consumption from 884,000 tokens per query down to roughly 1,160 tokens per query, a 99.9% reduction, while improving accuracy versus “naively exposing agents to an entire tool library.” For decision-makers, the immediate consequence is simple: lower token usage means lower API costs and faster responses, without giving up the ability to orchestrate multi-tool workflows like downloading datasets, transforming information, and creating visual reports.

Why this matters now: modern LLM agent systems increasingly rely on “skills,” which are modular, reusable tool specifications documented in structured natural language. Enterprise environments don’t just have one tool. They have massive ecosystems, including agent frameworks that can orchestrate multi-tool environments like the Model Context Protocol (MCP). But the moment your skill library grows, single-step routing approaches break down. Real requests are compositional. A business prompt like “Download the dataset, transform it, and create visual reports” cannot be fulfilled by one tool. You need a sequenced plan: an API client, a data processor, and a visualization tool, wired together coherently.

SkillWeaver is designed for that compositional reality. The system frames the challenge as “compositional skill routing”: given a complex user prompt and a vast library of tools, the agent must (1) break the prompt into a sequence of atomic sub-tasks, (2) map each sub-task to the single best available skill, and (3) compose those selected skills into an executable plan.

To do that, SkillWeaver runs through three stages: Decompose, Retrieve, and Compose. In Decompose, an LLM breaks the user’s query into sub-tasks, each requiring one skill. In Retrieve, the system uses an embedding model to compare each sub-task to the skill library and pull a shortlist of candidate tools. In Compose, a planner evaluates how the retrieved candidates work together. It checks inter-skill compatibility, then creates a Directed Acyclic Graph (DAG) that captures dependencies, so tasks that are independent can potentially execute in parallel.

The tricky part is that LLMs often generate generic step descriptions that do not match the technical vocabulary of the actual tools stored in the library. That is where Skill-Aware Decomposition (SAD) enters as a feedback loop. SAD drafts an initial plan, runs a preliminary search for loosely matching skills, then feeds those retrieved skills back into the LLM as hints. The LLM then revises its decomposition so the granularity and wording better align with the skills that actually exist.

The performance claims come from an enterprise-style benchmark called CompSkillBench, built from 300 multi-step queries across different difficulty levels. The environment uses a library of 2,209 real-world skills sourced from the public MCP ecosystem across 24 functional categories, including cloud infrastructure, finance, and databases. For the core engine, the researchers primarily used Qwen2.5-7B-Instruct, a lightweight 7-billion parameter model, for task decomposition, paired with a semantic search retriever using MiniLM with a FAISS index.

SkillWeaver was compared against three setups: a brute-force “LLM-Direct” method that stuffs all tool names into the prompt of a large model, a vanilla LLM-based decomposition without SAD, and a ReAct-style agent loop. The biggest bottleneck was decomposition. In the vanilla setup, the 7B model achieved decomposition accuracy of 51.0% (correctly predicting the correct number of steps). Activating SAD raised accuracy to 67.7%. With a larger Qwen-Max model, accuracy reached 92%. On hard tasks requiring four to five distinct skills, SAD improved accuracy by 50%. One more nuance is especially relevant for buyers who think “bigger model equals better routing”: in the vanilla setup, a larger 14-billion parameter model saw accuracy drop below the 7B model. It tended to over-decompose into microscopic steps. SAD anchored the model back by using retrieved tool hints, improving routing.

Token efficiency is where the business case gets sharper. The LLM-Direct baseline, even using Qwen-Max, struggled when flooded with tool options: it retrieved the right tool category only 21.1% of the time. SkillWeaver’s targeted retrieve-and-route approach vastly outperformed it while slashing context usage from an estimated 884,000 tokens down to roughly 1,160 tokens per query, a 99.9% reduction. The ReAct baseline completely failed on decomposition accuracy, achieving 0% decomposition accuracy because its loop collapsed multi-step plans into isolated actions rather than explicitly mapping steps to skills.

Second-order implications for executives: routing quality is not just an engineering detail, it becomes the controlling variable for cost, speed, and reliability. If decomposition drives both accuracy and token spend, then procurement decisions about which model to pay for may matter less than how well your system aligns decomposition vocabulary to your tool ecosystem. That is a board-level distinction: SkillWeaver suggests that you can get enterprise-grade orchestration without turning your context window into a dumpster fire, and you can do it with a decomposition process that is explicitly tool-aware.

Executive ActionsLocked