AI Tools Launches

Google’s New Diffusion Model Could Upend the LLM Speed Race

By Eli Grid

Posted on June 14, 2025

Google's New Diffusion Model Could Upend the LLM Speed Race

Image from deepmind.google

What if the next leap in large language models doesn’t come from a bigger model, but from a smarter method? Google DeepMind’s latest research demo hints at that possibility. Their new Gemini Diffusion approach replaces the slow, step-by-step thinking of traditional LLMs with something faster, parallel, and potentially more accurate.

INSIDE THE LAUNCH

Google DeepMind has introduced Gemini Diffusion, a research-stage language model that generates text using a technique borrowed from image generation: diffusion. Unlike traditional autoregressive LLMs like GPT or earlier Gemini models that predict one word at a time, this diffusion-based system starts with random noise and gradually denoises it into coherent sentences.

The experimental demo was quietly launched last month as part of Google’s broader AI showcase. While still in testing, Gemini Diffusion is accessible via waitlist, giving early adopters a peek at what might be the next evolution in text generation.

Speed is its standout trait. According to internal tests, Gemini Diffusion can produce between 1,000 and 2,000 tokens per second — significantly faster than Gemini 2.5 Flash’s 272.4 tokens/sec. In side-by-side tests, the demo completed coding tasks and web app builds in as little as two seconds.

So how does it work? During training, the model learns by corrupting sentences with noise and then learning to reverse the process. It doesn’t generate one token after another — instead, it rebuilds full sections of text in parallel, then refines and edits that output until it stabilizes into a high-quality result.

There are still trade-offs. The first token takes longer to appear, since the model waits until it has refined an entire sequence. There’s also a higher serving cost. But in return, you get faster outputs, better handling of long-range dependencies, and the ability to make holistic edits.

Performance-wise, Gemini Diffusion keeps pace with Google’s other top models. It scored 89.6% on HumanEval (a coding benchmark), and outperformed Flash-lite on math (23.3% vs. 20.0%) and MBPP coding tasks. On reasoning and multilingual tests, Flash-lite still has an edge — but the margin is narrowing.

WHY THIS COULD CHANGE THINGS

If autoregressive models are like writers crafting one word at a time, diffusion models are more like sculptors — shaping a rough block into a polished result. That shift could transform how text generation happens across industries.

For developers, faster output means snappier code suggestions in IDEs or smoother chatbot responses. For content teams, it may cut editing cycles by generating cleaner first drafts. And for edge AI or embedded applications, the ability to tune latency based on task complexity could finally make LLMs practical in constrained environments.

It’s also a leap in reasoning potential. Because diffusion models allow tokens to reference each other across the same generation block, the system can revise earlier parts of a sentence mid-stream — something autoregressive models can’t do easily. That’s a huge win for accuracy in tasks like math, coding, and logical reasoning.

EXPERT INSIGHT

“Diffusion models can produce a sequence of tokens in much less time than autoregressive models,” said Brendan O’Donoghue, a research scientist at DeepMind, in an interview with VentureBeat. “They allow global edits within a block and adapt computation depending on task complexity.”

He also acknowledged a few downsides: “There’s a slightly higher time-to-first-token, and the cost of serving is higher. But for many use cases, the trade-off is worth it.”

GAZEON’S TAKE: WHERE THIS GOES NEXT

Diffusion language models are still early-stage, but they’re coming fast. With rivals like Mercury from Inception Labs and LLaDa from GSAI entering the scene, diffusion may emerge as a parallel track to autoregressive systems — not a replacement, but a serious alternative.

If Google can scale Gemini Diffusion without performance drops, we could see this architecture baked into future enterprise offerings or even on-device LLMs. The promise of faster, self-correcting outputs is too useful to ignore.

JOIN THE CONVERSATION

Could diffusion finally dethrone autoregression in everyday AI tools? Or will it stay niche? Let us know your take.

About Author:

Eli Grid is a technology journalist covering the intersection of artificial intelligence, policy, and innovation. With a background in computational linguistics and over a decade of experience reporting on AI research and global tech strategy, Eli is known for his investigative features and clear, data-informed analysis. His reporting bridges the gap between technical breakthroughs and their real-world implications bringing readers timely, insightful stories from the front lines of the AI revolution. Eli’s work has been featured in leading tech outlets and cited by academic and policy institutions worldwide.