LCLMs cut LLM context 16x, speeding outputs 8.8x without accuracy collapse
NYU-led research compresses input before the decoder prefill, shrinking compute and memory costs for long-context agents.

A NYU-led research team, with collaborators including Columbia, Princeton, University of Maryland, Harvard, and Lawrence Livermore National Laboratory, introduced Latent Context Language Models (LCLMs) that compress context before it hits an LLM decoder. For decision-makers, the consequence is simpler economics for long-context RAG and agents, with measured speedups and accuracy tradeoffs that beat many KV-cache compression baselines.
If you are building with long-context LLMs, you already feel the squeeze. The longer an agent runs, the more tokens stack up from retrieved documents, conversation history, and even reasoning traces. That ballooning context turns into a real computational bottleneck: memory and compute rise with every added token, and in production you cannot just “wait for faster GPUs.”
This week, a paper from a NYU-led team (with collaborators including Columbia, Princeton, University of Maryland, Harvard, and Lawrence Livermore National Laboratory) claims a path around that bottleneck. Their Latent Context Language Models, or LCLMs, compress input context before it reaches the decoder. The punchline is specific: at 16x compression, LCLMs produced output 8.8 times faster than KV cache baselines on the RULER long-context benchmark, while avoiding the accuracy collapse that has limited most compression approaches in real serving.
To understand why this matters, you have to look at how most systems handle “long context” today. A dominant approach in the field is KV cache compression. Even when teams compress key-value caches, those methods still tend to materialize the full KV cache first, then evict entries. In other words, you pay most of the compute and memory bill up front, then try to salvage something later. LCLMs go earlier in the pipeline. They compress the input token sequence before decoder prefill, so higher compression ratios directly reduce decoder-side compute and memory.
The reported accuracy results are where the headline gets legs. On the RULER benchmark, the team reports that at 4x compression, accuracy is 91.76%, compared with 94.41% with no compression. That is less than a 3 point drop while cutting context to a quarter of its original size. At 16x compression, 93.75% of input tokens are removed, and accuracy falls to 75.06%. Every KV cache method tested at the same compression ratio scored lower. So yes, more aggressive compression costs accuracy. But the research suggests the “drop per token saved” is better when you compress before decoder prefill, rather than squeezing the cache after it is already built.
This also is not just a one-benchmark flex. The paper’s gains hold on shorter inputs too, including GSM8K math word problems, where the full prompt is compressed rather than only retrieved documents. In that setup, LCLMs outscored every other method tested regardless of compression ratio. For enterprises, that is a meaningful signal because many production workloads are not uniformly “long-document retrieval.” They are mixed: short prompts, sometimes long RAG, sometimes multi-turn. A compression technique that only works in one narrow context window is a deployment headache. LCLMs aim to be more broadly useful.
Under the hood, the architecture is an encoder-decoder compression setup built for end-to-end handling of long contexts. The architecture pairs a 0.6B encoder with a 4B decoder. The encoder compresses blocks of input tokens into shorter sequences of latent embeddings, and the decoder processes those in place of the original tokens. Training ran across more than 350 billion tokens. The training recipe mixes three data types: continual pre-training data with compressed and uncompressed spans interleaved throughout, supervised fine-tuning data covering reasoning and long-context tasks, and an auxiliary reconstruction task that pushes the encoder to retain fine-grained detail. The paper also notes an architecture search and that scaling the decoder matters more than scaling the encoder, which is a surprisingly operational point for teams thinking about compute budgets.
Where this fits in an agentic stack is arguably the most practical part. LCLMs are open-sourced on HuggingFace. Goldblum, a co-lead advisor on the project and a researcher at Columbia University, told VentureBeat that “You can simply swap out LCLMs for any existing LLM.” The integration story is direct: when you retrieve data such as documents and want to dump it into the model’s context, you run those documents through the LCLM compressor first. The researchers also demonstrated agents that selectively decompress useful text, described by Goldblum as akin to a human skimming content before zooming in on relevant details.
But there are sharp edges executives will want to understand before they bet a roadmap on this. Goldblum cautioned that teams integrating into existing agentic pipelines need to tune their RAG systems accordingly, since compression behavior can affect retrieval quality metrics. And he directly called out a gap: “We also haven't worked on online compression of reasoning traces.” The naive idea of periodically compressing the trace while generating might work, but “that remains to be determined.” In other words, LCLMs address context growth from retrieved documents and conversation history well, but reasoning trace compression is a different problem.
For enterprises, this lands in a market reality that is already turning into budget pressure. Context windows are growing faster than inference infrastructure can keep up, and companies are spending to patch the gap. VB Pulse Q1 2026 survey data from 100-plus employee organizations shows hybrid retrieval adoption intent tripling from 10.3% in January to 33.3% in March. Retrieval optimization overtook evaluation as the top investment priority by March, reaching 28.9% of qualified respondents. LCLMs fit neatly into that “retrieval optimization” momentum by promising a way to shrink what your model has to ingest, faster.
Three things stand out for teams evaluating production fit. First, inference cost scales with context length. At 1 million tokens, uncompressed inference with standard KV cache methods runs out of memory on a single H200 GPU. The paper reports that LCLMs at 16x compression remain within memory bounds at that context length. Second, RAG integration requires tuning. Teams will need to validate compression behavior against their retrieval quality metrics before deploying at scale. Third, reasoning trace compression is unsolved, which matters for long-horizon agents whose traces are part of what they “carry forward.”
The strategic stake is simple: if you can process much larger contexts at lower decoder memory and compute costs, you change the economics of long-context RAG and agent reliability. Goldblum frames it as both access to much larger contexts and an unlock for multiscale approaches where models skim vast amounts super fast, then zoom in and fully read only the most useful portion. For boards and operators, the question is no longer whether context windows are expanding. It is whether your cost structure and serving architecture can keep up. LCLMs are one of the first approaches presented in a way that looks like it could actually ship.
This story's Key Insights and Take-aways are locked.
Create a free account to unlock Executive Actions for one credit.
Register to UnlockAlways free for Executives Club members. Join the Club
More in Technology

Reddit turns comments into video feeds, letting users post clips directly in replies
Reddit’s comment system just gained video posting. Here’s what it changes for creators, advertisers, and moderation teams.

Amazon reveals 2.5B gallons of data-center water use in 2025, and the rate drops
New disclosure lands as regulators and employees push for limits, testing how big AI buildouts stay legal and credible.

Waymo Premier hits $29.99/month for frequent riders, with invite-only rollout in 3 cities
The robotaxi operator turns rider frequency into subscription leverage, bundling priority, credits, and limited free cancellations.
