Sapient’s HRM-Text trains from scratch in 1.9 days for about $1,500

A new model architecture swaps brute-force token prediction for instruction-response training, cutting cost and compute.

ByAbdullah Al-OtaibiBusiness Desk, The Executives Brief

about 2 months ago·4 min read

Sapient’s HRM-Text trains from scratch in 1.9 days for about $1,500

Executive summary

Sapient Intelligence researchers developed HRM-Text, replacing standard Transformers with a Hierarchical Recurrent Model (HRM) and training it from scratch on instruction-response pairs. For decision-makers, it challenges the assumption that foundation-model pretraining must be a multi-million-dollar, internet-scale enterprise.

Sapient Intelligence says researchers trained a foundation model from scratch in just 1.9 days on a cluster of 16 GPUs, with an estimated cost of about $1,500. That is the kind of number that makes the “just run next-token prediction on trillions of internet tokens” crowd pause.

The mechanism matters more than the speed. HRM-Text replaces standard Transformer-style computation with a highly sample-efficient Hierarchical Recurrent Model (HRM), first introduced last year, and it changes what the model is trained to do. Instead of brute-force autoregressive prediction on raw text, the researchers train exclusively on instruction-response pairs, using a task-completion objective where the model is rewarded for the full response rather than individual generated tokens. The result: a 1B-parameter model trained from scratch that they say hits 60.7% on MMLU, 84.5% on GSM8K, and 56.2% on MATH, which they report as competitive with, and in some cases surpassing, 2B to 7B parameter foundation models.

So what exactly is being challenged here? The original framing is blunt: training a foundation LLM from scratch costs millions and requires internet-scale data, which is why most enterprises do not bother. The existing “brute force” approach is described as scrape the internet, run next-token prediction trillions of times, and then assume the model has developed an internal model of the world. But Sapient’s researchers argue that we do not actually care if a model memorizes an exact sequence of words from an irrelevant source like a random 2014 Reddit thread. In their view, the compute is wasted on reconstructing prompts that are already known at inference time.

This is where enterprise economics enters the room, not just model architecture. In comments provided to VentureBeat, Guan Wang, CEO of Sapient Intelligence, framed the problem as the “economics of iteration.” He points to three compounding issues for enterprises: training is expensive, infrastructure is heavy, and experimentation cycles are too slow. The “scaling addiction” Wang references is the response pattern of “when the model fails, make it bigger, add more data, add more GPUs.” He argues that scaling has been working, but is now bumping into diminishing returns. The second-order effect, in Wang’s framing, is that more scale can mean more memorization, more latency, more infrastructure, and more vendor dependency, without necessarily improving reasoning.

If you are a CFO or a board member, the important question is not whether 1B beats 7B in a benchmark chart. It is whether the enterprise can iterate faster, keep proprietary data in-house, and still get reasoning that behaves. Wang describes a practical example: hedge funds, insurers, and banks with proprietary data such as internal research notes, transaction logic, compliance rules, analyst memos, risk models, and portfolio constraints. They may not want to send that data to an external frontier model. They also may not need a giant general-purpose model that memorized the internet. What they want, in Wang’s terms, is a compact reasoning core that can learn task structure, reason across rules and numbers, and run in a controlled environment.

HRM-Text tries to meet that demand by changing the pretraining recipe and the architecture at the same time. HRM was introduced in 2025 and is described as a fundamental departure from traditional Transformer models. HRM decouples computation into slow-evolving strategic layers and fast-evolving execution layers. In practice, the system uses two high-level cycles, each consisting of three fast L-module updates followed by a single slow H-module update. The paper’s framing is that separation is mathematically necessary, not cosmetic. Language is described as not like bounded, clean puzzle worlds. It needs both fast local refinement and slow semantic stability.

The complication is that recurrent loops that are helpful for controlled, symbolic reasoning can become unstable when scaled to language tasks, specifically due to exploding or vanishing gradients. HRM-Text introduces two key fixes to address that: MagicNorm, a specialized normalization technique intended to keep internal signals stable across loops, and a warm-up method during early training. During warm-up, the model is only evaluated on short, shallow reasoning loops. As training progresses, it gradually gets longer and deeper reasoning sequences. The researchers also changed the training objective away from next-token prediction and toward task completion, which aligns the incentives closer to enterprise workflows: users ask for targeted answers to tasks, not arbitrary text continuation.

Then there is the data itself. Instead of standard pipelines that require churning through trillions of words of raw internet text, HRM-Text was trained from scratch on a tightly curated dataset of just 40 billion tokens, all instruction-response pairs spanning general instructions, math, symbolic logic, textbook exercises, and rewritten knowledge. To further push reliance on the hierarchical structure, the training data explicitly strips out “thinking” tokens. The researchers evaluated HRM-Text on standard foundational AI benchmarks with emphasis on knowledge, reasoning, logic, math, and comprehension, testing against small models and both open-weight and fully open models.

The immediate takeaway is the compute-to-performance shift. Pretraining a foundation model from scratch is typically portrayed as a multi-million-dollar endeavor reserved for tech giants. By comparison, Sapient reports HRM-Text was trained in 1.9 days on 16 GPUs. If that holds up beyond initial reports, it changes who gets to experiment. It also changes the board-level conversation from “Can we afford training?” to “Can we afford iteration cycles, and how quickly can we productize with proprietary constraints?” In a world where the training bottleneck and infrastructure burden have traditionally limited who can build reasoning models, HRM-Text’s approach makes foundational pretraining look less like an exclusive club and more like something enterprises can actually run themselves and pair with external knowledge stores.

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedsapient hrm-text foundation-models machine-learning-architecture enterprise-ai instruction-response mmlu gsm8k gpus model-training

Sapient’s HRM-Text trains from scratch in 1.9 days for about $1,500

This story's Key Insights and Take-aways are locked.

More in Business

Anthropic’s Levant Alpöge cracks the Jacobian conjecture after 87 years

Uber buys Delivery Hero for nearly $15B, vaulting to top food delivery outside China

Epic and Google drop settlement bid, forcing rival Android app stores by July 22