Trunk Tools cuts document review from 60 to 10 days with a purpose-built AI stack
Construction automation gets faster, more reliable, and more agent-ready by replacing generic LLMs with a three-layer architecture.

Trunk Tools, led by founder and CEO Sarah Buchner, built a specialized three-layer AI stack (perception, semantics, agents) to shrink construction document review cycles from 60 days to 10. The result is industry automation that can reason over large documentation sets while reducing costly field errors, offering a playbook for other verticals stuck in document chaos.
Most verticals do not run on neat SaaS tables. They run on ugly, inconsistent documents, proprietary formats, and long-running workflows that general-purpose AI tends to treat as “probably something.” That is the problem Trunk Tools decided to bully into submission, and the payoff is unusually concrete: it says its approach cut document review cycles from 60 days down to 10.
Trunk Tools, a construction project management company founded by Sarah Buchner (a former carpenter) and led as CEO, built what it calls a specialized three-layer architecture: perception, semantics, and agents. The stack is designed around highly detailed, domain-specific data. In Trunk’s framing, the point is not just better extraction, but higher accuracy and higher relevance, which it says prevents costly field errors and enables autonomous agents to reason over millions of pages of documentation.
Why does this matter beyond construction? Because the mainstream LLM story is still too broad for work that lives and dies on context. Foundation models are optimized for breadth, not depth, and several experts quoted in the piece argue that “general-purpose LLMs are trained to be okay at everything, so they're weak at anything niche.” In practice, that weakness shows up as unreliability on jargon-dense, abbreviation-heavy, format-specific inputs, plus the unspoken context practitioners “just know.” There is also a data reality check: the most valuable enterprise information typically never made it into pretraining anyway, because it sits in internal systems and proprietary formats. RAG can help by retrieving better facts, but it does not automatically solve domain reasoning.
Trunk’s architectural answer is to stop asking a general model to do everything and start structuring the problem the way the work actually happens. The piece lays out the core logic for why: in highly specialized domains, “data dumps” into LLMs do not cut it. Trunk’s CTO Amrish Kapoor explains that many transformers are probabilistic models. When given an image, they return what it is “probably” a tree. That uncertainty becomes dangerous for symbolic interpretation where a small difference can completely change meaning. In construction documents, even a 2-millimeter-wide symbol can mean something different depending on where it is placed, and context limits are not the only constraint. Kapoor points to long-term project memory, across months and years, as another gap that probabilistic systems struggle to handle.
So Trunk breaks the workflow into three layers. First is perception: reading and extracting data from messy inputs like PDFs, drawings, or scans. Next is the semantic or graph layer: turning that extracted data into meaning by understanding relationships. Buchner describes it as teaching AI to read construction drawing language, where something is not always explicitly labeled. A door might be an arc on a wall, and the “language” is learned through years of practice, then mapped to meaning through the semantic layer. That semantic layer can connect a door to the drawing, the specification, and the trade that installs it. Instead of just asking “is there a door here?”, the system is aimed at questions like “does this door create a problem down the line?” Buchner ties the cost of errors to time, arguing that conflicts found in design are relatively low cost compared to problems caught in the field, which can cost tens of thousands of dollars.
On top of those layers come LLMs and agentic workflows. At a high level, Trunk’s system identifies document types (drawing, schedules, paragraph text) and extracts information based on content, then “transforms and augments” the data to trigger workflows. The examples in the piece are practical: an agent can review an architecture bulletin and produce a visual overlay comparing an older version and a newer version, flagging additions and removals, then generate a written narrative that describes the changes in simple terms. That helps users understand what changed and coordinate with trade partners on updated pricing and change orders.
Trunk also spends time on the boring parts that are usually where these projects fail: training data quality, evaluation, and shipping reliability. The article says Trunk trains all three layers on very specific datasets from customers with explicit permissions and auto-labeling and IP. Customers can opt out of having Trunk train on their data. Trunk says data is deidentified and aggregated, and it also collects “tons more” labeled data via pipelines like 3D building information modeling (BIM). It claims it only ships agents that achieve around 95% accuracy, and it uses continuous evaluation pipelines based on customer and expert ground truth. The piece also notes an evaluation technique: an “LLM as a judge” model that scores performance both subjectively and objectively. Kapoor describes objectivity scoring as more than just right or wrong, because subjectivity needs nuance, and judge frameworks can create composite scores that aggregate different metrics and risk tests. There is a tradeoff to watch: Buchner notes latency can be challenging, and higher reasoning capacity increases latency risk. Trunk says it has evaluation criteria to measure latency changes whenever it updates infrastructure, agents, and API calls.
Zoom out and you can see the second-order implication for executives: the future is not “one model for every document.” It is a stack that treats documents like structured work products, with explicit preprocessing, domain-specific semantics, and agent workflows governed by continuous evaluation. For decision-makers in regulated and high-stakes industries, the appeal is obvious: errors can become expensive very fast, and the cost of reliability failures compounds with time. The construction numbers may be eye-catching, but the pattern is bigger. Enterprises across legal, healthcare, and other document-heavy verticals can likely get similar gains by building hybrid systems that separate extraction, domain meaning, and output formatting requirements, instead of expecting general LLMs to reliably handle the niche on their own.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

JFrog finds North Korea-linked npm packages stealing secrets by mimicking Rollup polyfills
Malicious packages impersonate legitimate Rollup polyfill tooling, aiming to steal developer credentials and open remote access to infected machines.
ChatGPT hit 1B monthly users fastest ever. Now Claude and Meta AI are sprinting past growth
OpenAI’s milestone is historic, but rival adoption curves suggest the next battle is speed, not just scale.

Tesla sells the six-seat Model Y Long Wheelbase for $61,990 in the US
A third-row expansion hits the Model Y lineup with a new Launch Series starting price and broader availability.

