DiffusionGemma runs 4x faster than Gemma in parallel text generation

Google DeepMind’s open model changes the “token-by-token” default and hits local GPUs with real speed gains.

ByLama Al-RashidTechnology Correspondent, The Executives Brief

about 5 hours ago·3 min read

DiffusionGemma runs 4x faster than Gemma in parallel text generation

Executive summary

Google DeepMind released DiffusionGemma, a new member of the Gemma 4 open model family, built to generate text blocks in parallel. For decision-makers, it raises the bar on open-model performance, especially for on-prem and cost-sensitive deployments.

Google DeepMind’s new open model, DiffusionGemma, can output around 700 tokens per second on an RTX 5090, and 1,000+ tokens per second with a single Nvidia H100. That roughly 4x output rate versus similarly sized autoregressive Gemma models is the point: this model does not build text one token at a time.

Instead of the standard left-to-right, autoregressive approach, DiffusionGemma produces an entire block of text in parallel. Google’s description matters because it directly targets latency and efficiency, especially when you are running on local hardware like an Nvidia DGX or even a gaming GPU. In plain English, the model uses a canvas of placeholder tokens, updates it repeatedly, then “finalizes” a denoised block of tokens at the end. The speed boost is not just a benchmark flex. It changes the underlying compute pattern, which is where cost and responsiveness usually live or die in real deployments.

To understand why “parallel text generation” is a big deal, compare it to how most AI text models behave. Autoregressive models generate text sequentially, token by token. That can feel fast when you watch it, but it is still fundamentally limited by step-by-step generation. DiffusionGemma borrows ideas from image generation models, which start with a static image field and then denoise it to arrive at the final output. Here, the denoising is over a token field, and the model runs multiple iterations across the canvas to improve its estimation of likely tokens.

The mechanics are more than technical trivia. The model takes placeholder tokens over the canvas and runs them multiple times. Those repeated passes let the model refine likely token choices across the whole block rather than committing each token in strict sequence. Only at the end does it finalize the token outputs as one large block, the “denoised” text canvas. That shift is the reason the speed can scale differently than autoregressive generation, particularly when you are trying to keep time-to-first-output low.

DiffusionGemma is also designed to be runnable in environments executives actually care about: local inference. The model is described as a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference. That matters for deployment because it implies memory and compute requirements that can fit into an 18GB RAM allotment of a high-end GPU. In other words, Google is making a strong case that this is not just for warehouse-scale systems.

On the hardware side, the numbers are specific. Testing with an RTX 5090 shows around 700 tokens per second output. With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second. The “about four times” comparison is to similarly sized autoregressive Gemma models. If you run services where tokens per second translate into throughput, that means the same GPU budget can support more concurrent users, or you can reduce time-to-completion for batch tasks like summarization. Either way, it is a lever that affects unit economics.

Zoom out one step and the strategic context gets interesting. This is another move in the open model arena, where speed and efficiency are increasingly part of the competitive differentiation. Google DeepMind is not just releasing another checkpoint. It is introducing an architectural twist that could reshape expectations for what open text models can do on consumer-class hardware and standard enterprise accelerators.

There is also a regulatory and policy backdrop, even if the source does not mention specific compliance rules. When more capable open models run locally, organizations can potentially reduce reliance on sending every prompt to remote servers. That can change privacy and governance workflows, including how data retention, access controls, and audit trails are managed. More capable on-prem inference also tends to increase internal adoption pressure, because procurement and security teams are often more comfortable evaluating systems where data handling is under tighter organizational control.

For boards, founders, and investors, the second-order question is not “How fast is DiffusionGemma?” It is “What happens to the market’s cost curve if parallel text generation becomes normal?” If a similar approach spreads across other open model families, performance-per-dollar and performance-per-GPU may rise in ways that disrupt current assumptions about inference pricing, model serving architectures, and the buying behavior of teams that today prioritize autoregressive systems. DiffusionGemma is a reminder that the next big frontier in AI might not be just better quality. It could be faster output, cheaper compute, and more deployment flexibility built into the model’s core design.

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedgoogle deepmind gemma diffusiongemma open-models ai-inference mixture-of-experts nvidia-gpu latency

DiffusionGemma runs 4x faster than Gemma in parallel text generation

This story's Key Insights and Take-aways are locked.

More in Technology

Apple TV and Google TV Streamer enable Thread 1.4 credential sharing

OpenAI says China-linked bots used ChatGPT to attack US data centers

CrowdStrike: North Koreans drove about half of hacks in last 12 months