DocLang aims to make PDFs unnecessary for AI parsing, slashing token costs up to 30x

Linux Foundation pushes an open AI-native document format built for LLM tokenizers, not human rendering.

ByOmar Al-BalawiTechnology Correspondent, The Executives Brief

about 7 hours ago·4 min read

DocLang aims to make PDFs unnecessary for AI parsing, slashing token costs up to 30x

Executive summary

The Linux Foundation's LF AI & Data Foundation formed a working group to steer DocLang, an AI-friendly document format backed by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis. The group argues that today's formats like PDF and HTML lose structure and meaning in AI pipelines, driving higher token costs and worse accuracy.

If your enterprise AI reads PDFs today, you may be paying a tax that is invisible on your dashboard and brutal in your token meter. The new DocLang push, from the LF AI & Data Foundation at the Linux Foundation, is explicitly trying to make that tax optional by standardizing an AI-native way to represent document structure, layout, meaning, and governance.

At the core is the claim that existing formats were built for humans, not machines. DocLang is designed for LLM tokenizers with markup that maps DocLang elements to LLM tokens on a 1-to-1 basis, using a limited XML vocabulary aligned to those tokenizers. It is described as lossless, so the AI conversion should not erase valuable information when moving from documents into AI-ready inputs. And the payoff, according to ABBYY benchmarks cited in the spec coverage, is that costs can fall dramatically, with “4x to more than 30x lower cost” depending on the model. In a concrete example, an IBM 2025 annual report PDF produces 8,421 input tokens and 512 output tokens, while a DocLang version uses 5,310 input tokens and 498 output tokens, with lower latency (2.7s vs 4.2s) and better quality, where the PDF case missed a subsection and mangled a table merger.

That is the headline. Here is the why, and why it matters for executives who are already fighting the “AI will be expensive” narrative. Token pricing is variable, and document length and complexity can turn ingestion into a recurring cost center, not a one-time migration. The source points to AI Cost Check guidance that OCR a PDF baseline requires about 1,200 input tokens and 150 output tokens. One-off use might be tolerable. At scale, variability and long documents change the economics fast, especially if teams reach for an expensive frontier model. If your pipeline spends tokens deciphering layout rather than extracting meaning, you get two problems at once: higher bills and lower reliability.

DocLang’s spec authors argue that the real issue is structural and semantic decay. PDFs, Markdown, HTML, and LaTeX do not keep the same information once an AI system has to convert them into tokens. In the coverage, the spec explains that Markdown lacks sufficient scope, HTML is excessively verbose, and LaTeX allows too much ambiguity. The argument is not merely “these formats are hard.” It is that when AI models turn them into tokens, you lose semantic information, structural relationships, or geometric context. That forces models into guesswork.

ABBYY frames the downstream effect in sharper business terms. Jon Knisley, AI Value and Enablement Lead at ABBYY, says that every time a PDF enters an AI pipeline, “structure, meaning and layout get lost,” which bottlenecks accuracy by document quality rather than model quality. Teams compensate by building custom parsers at every integration point, leading to brittle, one-off work and an engineering sprint whenever a new document type arrives. Knisley also ties ambiguity directly to cost and risk: ambiguous structure drives guesswork, increasing hallucination risk and burning tokens deciphering layout instead of extracting meaning. His expectation for DocLang is better accuracy, lower costs, fewer tokens consumed, faster performance, and more consistent outputs. The exact savings vary by use case and document complexity, but the cited initial benchmarks are the “4x to more than 30x lower cost” range.

There is also a governance angle, and for enterprise leaders it is not a side quest. Knisley notes that document provenance data and metadata can get stripped when documents move through systems. DocLang is described as keeping that information attached, with governance and provenance treated as part of the representation rather than something bolted on later. That matters because compliance-friendly pipelines are often punished by translation layers: every conversion can become a new failure mode for auditability, classification, or lineage.

DocLang is being built as an open standard, not a private vendor format. The coalition behind it includes IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, organized under a Linux Foundation working group. The late 2024 detail in the coverage is also useful context: IBM developed an open source toolkit called Docling to facilitate AI document parsing, positioned as similar in spirit to Microsoft’s MarkItDown or the Marker project. DocLang expands on that foundation by standardizing the exchange of structured output across systems, aiming to help enterprises feed files to AI systems in a more deterministic way.

Still, the source warns not to overread early adoption. Knisley says it is “still early” and does not overstate uptake. But the group is actively inviting more technology providers and enterprises to join, and the early response is described as encouraging. For boards and executive teams, the strategic stakes are straightforward: if DocLang succeeds, it shifts document ingestion from bespoke, brittle parsing to a standardized representation optimized for LLM tokenizers. That can mean fewer integration projects, more predictable token spend, and better model performance on messy real-world enterprise documents. Or, if it stalls, teams that adopted it early could face the familiar question every enterprise innovation program eventually meets: do you wait for the standard to become dominant, or build your own path and accept the cost of divergence?

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedlinux-foundation doclang ibm abbyy llm document-parsing token-costs open-standards ai-governance

DocLang aims to make PDFs unnecessary for AI parsing, slashing token costs up to 30x

This story's Key Insights and Take-aways are locked.

More in Technology

Cristiano Amon bets smart glasses can rival smartphones, backing AI agents and new devices

Trump export-control order forces Anthropic to suspend Mythos 5 and Fable 5

Microsoft adds Amazon capacity for GitHub after AI outages and reliability failures