Diffusion LLMs Explained: Mercury and Gemini Diffusion

By Aisha Patel, AI Editorial Desk · July 2, 2026 · 12 min read

Updated July 2, 2026

Quick Answer

Standard LLMs are autoregressive: they generate text one token at a time, left to right. Diffusion LLMs (dLLMs) instead start from noisy or masked tokens and denoise the whole sequence in parallel over a handful of iterations, which is why vendors report throughput several times higher on the same GPUs. Inception Labs Mercury is the first commercial-scale diffusion LLM, and Google demoed Gemini Diffusion at I/O 2025; both trace their lineage to research like LLaDA and earlier diffusion-LM work. The speed and native infilling are compelling for agents, coding autocomplete, and cost, but the ecosystem is young and quality is uneven on some tasks. This is an editorial synthesis of vendor documentation, public benchmarks, and community reports.

Key Insight

TL;DR

Almost every large language model you have used - GPT-5.1, Claude 4.8, Gemini 3, Llama 4 - is autoregressive: it writes text one token at a time, left to right. Diffusion LLMs (dLLMs) do something different. They start from noise or masked tokens and denoise the entire sequence in parallel over a handful of iterations, coarse-to-fine, the same way image diffusion models turn static into a picture. The payoff vendors report is throughput several times higher on the same GPUs, plus native infilling and revision. Inception Labs Mercury is the first commercial-scale diffusion LLM; Google demoed Gemini Diffusion at I/O 2025. This is an editorial synthesis of vendor documentation, public benchmarks, and community reports; the tradeoffs are real and this technology is still young.

Quick Answer

A diffusion LLM generates text by iteratively denoising a whole sequence in parallel instead of predicting tokens one at a time. Because many positions are refined at once over just a few steps, vendors report throughput on the order of 1,000+ tokens per second on standard GPUs, versus a few hundred for speed-optimized autoregressive models. Diffusion models also edit and infill naturally. The catch in mid-2026 is a younger ecosystem, uneven quality on some tasks, and less mature tooling than the autoregressive stack.

How This Comparison Was Built

This is an editorial explainer, not a lab benchmark. Every throughput figure below is vendor-reported, drawn from Inception Labs and Google DeepMind materials, the Mercury and LLaDA papers on arXiv, and community reporting. We did not run controlled tests. Where a hard number cannot be independently confirmed, we say so and lean on qualitative framing. If you are choosing an architecture for production, verify current numbers against each vendor and, ideally, your own workload.

The Core Idea: Two Ways to Write a Sentence

Imagine writing an essay. The autoregressive way is to write it strictly in order: word one, then word two, each choice locked in before you think about the next. You can never go back and revise word three once you are on word ten. This is exactly how a standard transformer decoder works. At inference time it predicts the next token, appends it, feeds the whole thing back in, and predicts the next one. Every token requires a full forward pass through the model, and those passes happen strictly one after another. Quality has been excellent, but the sequential dependency is a hard speed ceiling.

The diffusion way is different. You start with a blurry, mostly-blank draft - many positions masked or filled with noise - and you sharpen the whole thing at once. Then you sharpen it again. After a few passes, a coherent sequence emerges. Nothing forces you to finalize position three before position ten; the model can revise any position on any step. This is the coarse-to-fine idea that image diffusion models made famous, adapted to discrete text tokens.

That single architectural difference - parallel refinement instead of sequential appending - is the root cause of everything else in this article: the speed, the native infilling, and also the current rough edges.

Why Diffusion Is Faster

In an autoregressive model, generating N tokens means roughly N sequential forward passes. You cannot parallelize across the sequence dimension during generation, because token 100 genuinely depends on token 99. Inference providers work heroically around this - techniques like speculative decoding, batching, and custom silicon from specialists we covered in our inference providers comparison - but the fundamental left-to-right dependency remains.

A diffusion model generates all N tokens across a small, roughly fixed number of denoising steps - often far fewer than N. Each step refines many positions in parallel. If you can produce a full block of tokens in, say, a dozen steps instead of hundreds of sequential passes, throughput goes up dramatically. That is the mechanism behind the eye-catching numbers vendors report.

How big are those numbers? Inception reports its Mercury Coder models running on the order of 700 to 1,100 tokens per second on NVIDIA H100 GPUs, and frames diffusion as roughly an order of magnitude faster than speed-optimized frontier models that typically top out in the low hundreds of tokens per second. Google reported Gemini Diffusion generating in the 1,000 to 2,000 tokens per second range when it demoed the model at I/O 2025. Every one of those figures is vendor-reported under vendor conditions - independent, apples-to-apples benchmarking of diffusion LLMs is still immature - so read them as directional, not gospel.

The Other Superpower: Infilling and Revision

Speed gets the headlines, but the ability to edit is arguably just as important. Because a diffusion model refines the whole sequence, it can fill in a gap in the middle of your text conditioned on both the left and right context. Autoregressive models can be coaxed into infilling with special training and tokens, but for diffusion it is native to how the model works.

This matters for coding. Real code editing is rarely append-only - you insert a line, rename a variable, patch a function body between existing code. A model that natively conditions on both sides of the cursor is a better structural fit for inline autocomplete and refactoring than a strictly left-to-right one. It also gives diffusion models a built-in error-correction character: a token chosen early can be revised on a later denoising step, rather than being frozen the moment it is emitted.

The Products

1. Inception Labs Mercury - Best for the first production-ready diffusion LLM

Best for: teams that want to actually call a diffusion LLM in production today, especially for fast coding and low-latency generation.

Inception Labs, founded by researchers with deep roots in diffusion modeling, shipped Mercury as what it describes as the world's first commercial-scale diffusion large language model. The Mercury Coder line targets code generation and completion, where the combination of high throughput and native infilling is a natural fit. Mercury is available through Inception's own API and has shown up on marketplaces including Amazon Bedrock Marketplace and SageMaker JumpStart, which matters a lot for enterprise adoption. The company raised a reported $50M seed round in late 2025 with backers including Menlo Ventures and prominent AI figures.

Approach: Transformer-parameterized diffusion trained to predict multiple tokens in parallel, using a coarse-to-fine denoising process.
Throughput: vendor-reported roughly 700 to 1,100 tokens per second for Mercury Coder models on NVIDIA H100 GPUs.
Access: Inception API plus marketplace availability (Amazon Bedrock, SageMaker JumpStart) as of mid-2026.
Pricing: vendor-listed around $0.25 per million input tokens and $1.00 per million output tokens for the API at the time of writing - competitive with lower-priced commercial APIs; verify current numbers.
Paper: the Mercury technical report is on arXiv (2506.17298).

2. Google Gemini Diffusion - Best signal that a frontier lab is serious about diffusion text

Best for: understanding where a major lab thinks diffusion text generation could go; not a general production dependency yet.

Google DeepMind demoed Gemini Diffusion at Google I/O 2025, and it was one of the sleeper announcements of the event. It applies the same block-parallel denoising idea, and Google reported very high generation speeds alongside surprisingly strong coding and math performance for a first public diffusion effort. The important caveat is maturity: as of mid-2026 Gemini Diffusion has been positioned as experimental or waitlisted rather than a broadly available production model like the mainline Gemini 3 series.

Approach: iterative denoising that generates blocks of tokens in parallel rather than sequentially.
Throughput: vendor-reported roughly 1,000 to 2,000 tokens per second at the I/O 2025 demo.
Access: experimental / limited access as of mid-2026 - check the DeepMind Gemini Diffusion page for current status.

3. LLaDA and the research lineage - Best for understanding where this came from

Best for: engineers and researchers who want the academic grounding, not a product to call.

Diffusion for text did not appear out of nowhere. The line runs through diffusion-LM research and, more recently, masked diffusion models. LLaDA (Large Language Diffusion with mAsking) scaled a masked diffusion model to 8B parameters and reported being competitive with strong autoregressive baselines like Llama 3 8B on in-context learning, while natively handling tasks that trip up left-to-right models - its bidirectional generation famously helps with the reversal curse. It is open research with code, which makes it the best on-ramp for anyone who wants to understand the mechanism from the inside.

Approach: masked diffusion - a forward masking process plus a learned reverse process that predicts masked tokens with a standard Transformer.
Scale: an 8B-parameter model in the reference work, competitive with similar-size autoregressive models on many tasks.
Paper: Large Language Diffusion Models, arXiv 2502.09992.

Autoregressive vs Diffusion at a Glance

Dimension	Autoregressive (AR)	Diffusion (dLLM)
---	---	---
Generation order	Left-to-right, one token at a time	Whole sequence refined in parallel
Parallelism	Limited across sequence during generation	High - many positions per step
Vendor-reported speed	Hundreds of tokens/sec (speed-optimized)	~1,000+ tokens/sec on standard GPUs
Editing / infilling	Possible but not native	Native - can revise any position
Error correction	Token frozen once emitted	Tokens can be revised across steps
Ecosystem maturity	Very mature (tooling, fine-tuning, providers)	Young, growing quickly
Fine-tuning / control	Well understood	Less mature
Best fit	Broad general use, long-form, reasoning	Latency-bound loops, coding autocomplete, infilling

Why Speed Actually Matters

It is tempting to dismiss raw tokens-per-second as a vanity metric. It is not, for three concrete reasons.

Agents live and die on latency. An agent that makes ten tool calls in a loop pays the model's latency ten times, and each round trip blocks the next. Slow generation compounds. If you have watched an agent crawl through a multi-step plan, you already understand the pain - and much of it traces back to sequential decoding, one of the failure modes we discussed in why AI agents lose context and how to fix it. Cut per-call latency and the whole loop feels different.

Coding autocomplete has a human in the loop. Inline completion is judged in milliseconds. A suggestion that arrives after you have already typed the line is worthless. High throughput plus native infilling is close to an ideal match for the autocomplete use case, which is exactly why the first commercial diffusion products target code.

Speed is cost. On per-token pricing, faster models with efficient decoding can be cheaper to serve. On self-hosted GPUs, higher throughput means more requests per GPU-hour. Either way, the economics of running LLMs at scale reward efficiency - and pairing a fast model with techniques like prompt caching compounds the savings.

The Honest Tradeoffs

Diffusion LLMs are genuinely exciting, and they are also genuinely young. A few things to keep level-headed about in mid-2026.

Ecosystem and tooling. The autoregressive world has years of accumulated infrastructure: serving stacks, quantization recipes, evaluation harnesses, and a deep bench of inference providers. Diffusion tooling is catching up fast but is not there yet. If you need a mature stack today, that gap is real.

Quality is uneven. Vendors report strong coding and math results, but diffusion models have not yet accumulated the breadth of public evaluation that autoregressive frontier models have. On some long-form or reasoning-heavy tasks - the domain of the models covered in our reasoning models comparison - the autoregressive incumbents remain the safer bet. Validate on your own workload.

Fine-tuning and control. The recipes for adapting and precisely steering diffusion LLMs are less battle-tested than the autoregressive equivalents documented in our fine-tuning guide. If your product depends on heavy customization, factor that in.

Number hygiene. Every speed figure here is vendor-reported. Benchmarking generation speed fairly is hard - it depends on hardware, batch size, sequence length, and quality thresholds. Do not port a demo number straight into a business case.

Where Diffusion Fits Alongside the Rest of the Stack

The realistic mid-2026 picture is coexistence, not replacement. Diffusion LLMs are a strong fit for latency-bound agent loops, inline coding tools, and infilling-heavy or structured generation. Mature autoregressive models - including the strong open-source options you can self-host - remain the default for the broad set of tasks where ecosystem, quality breadth, and controllability matter most. The most interesting open question is how much of the speed advantage holds as diffusion training and inference mature, and whether hybrid approaches capture the best of both.

Conclusion

Diffusion LLMs are the first architecturally fresh challenge to the autoregressive transformer to reach real products. The mechanism is elegant: refine a whole sequence in parallel instead of appending one token at a time, and you unlock both speed and native editing. Inception Labs Mercury proves it can ship commercially; Google Gemini Diffusion shows a frontier lab is invested; LLaDA and the broader diffusion-LM lineage show it is grounded research, not hype. Keep the tradeoffs honest - young ecosystem, uneven quality, vendor-reported numbers - and diffusion becomes a serious tool to reach for in exactly the places where left-to-right generation hurts most: agents, autocomplete, and cost.

This is an editorial synthesis of vendor documentation, public benchmarks, and community reports; see our [methodology](/methodology). Verify current details with each vendor.

Key Takeaways

Autoregressive models generate left-to-right one token at a time; diffusion models refine an entire sequence in parallel over a few denoising steps, which is the root cause of the speed difference.
Vendors report diffusion LLMs hitting roughly 1,000+ tokens per second on standard NVIDIA GPUs, versus a few hundred for speed-optimized autoregressive models; treat all such figures as vendor-reported.
Inception Labs Mercury is the first commercial-scale diffusion LLM and is available via API and marketplaces like Amazon Bedrock; Google Gemini Diffusion remains experimental as of mid-2026.
Diffusion models are naturally good at infilling and editing because they can revise any position, not just append to the end, which suits code completion and structured output.
The main tradeoffs are ecosystem maturity, uneven quality on some reasoning-heavy tasks, weaker tooling, and less mature fine-tuning and controllability compared with the autoregressive stack.
Speed matters most where latency compounds: agent tool loops, inline coding autocomplete, and any workload where you pay per token or per second of GPU time.
The academic lineage runs through diffusion-LM research and masked diffusion models like LLaDA, so this is a maturing research direction rather than a single vendor gimmick.

Frequently Asked Questions

What is a diffusion LLM?

A diffusion LLM (dLLM) is a language model that generates text by starting from noisy or fully masked tokens and iteratively denoising the whole sequence in parallel, rather than predicting one token at a time from left to right. It borrows the coarse-to-fine idea from image diffusion models and applies it to discrete text tokens.

How is a diffusion LLM different from a normal (autoregressive) LLM?

A standard autoregressive model like GPT-5.1 or Claude 4.8 produces tokens sequentially, each conditioned on all previous tokens. A diffusion model refines many token positions at once across a small number of iterations. That parallelism is why vendors report much higher throughput, and it also lets diffusion models edit or infill any position rather than only appending to the end.

Are diffusion LLMs actually faster?

Vendors report large speedups. Inception reports Mercury Coder models generating on the order of 700 to 1,100 tokens per second on NVIDIA H100 GPUs, and Google reported Gemini Diffusion in the 1,000 to 2,000 tokens per second range at I/O 2025. These are vendor-reported figures under vendor conditions; independent, apples-to-apples benchmarks are still maturing, so treat exact numbers with caution.

Is Mercury or Gemini Diffusion available to use today?

As of mid-2026, Inception Labs Mercury is a commercial product available through its own API and via marketplaces such as Amazon Bedrock and SageMaker JumpStart. Google Gemini Diffusion was demoed at Google I/O 2025 and has been positioned as experimental or waitlisted rather than a broadly available production model. Check each vendor for current status.

What are diffusion LLMs good at?

They shine when speed and infilling matter: inline coding autocomplete, agent loops with many short tool calls, and structured or fill-in-the-blank generation. Because they can revise the middle of a sequence, they handle editing and infilling more naturally than left-to-right models.

What are the downsides of diffusion LLMs?

The ecosystem is young. Tooling, fine-tuning recipes, and provider support lag the mature autoregressive stack. Quality can be uneven on some reasoning-heavy or long-form tasks, and fine-grained controllability is less well understood. For many production workloads, a fast autoregressive model on a specialized inference provider is still the safer default in 2026.

Do diffusion LLMs replace autoregressive models?

Not today. The realistic mid-2026 picture is coexistence: diffusion models for latency-sensitive and infilling-heavy workloads, autoregressive models for the broad set of tasks where the ecosystem, quality, and tooling are more proven. The interesting question is how much of the speed advantage holds up as diffusion training and inference mature.

About the Author

Aisha Patel

AI Editorial Desk

AI Editorial Desk · Web3AIBlog

Aisha Patel is a pen name for our AI editorial desk. Posts under this byline are written and reviewed by our team of contributors with backgrounds in machine learning, large language models, AI infrastructure, and applied research. The desk covers frontier model releases, agent architectures, retrieval-augmented generation, on-device inference, and the engineering tradeoffs that matter when shipping AI in production. Every technical claim is verified against primary sources before publication.

@web3aiblog LinkedIn

Key Insight

TL;DR

Quick Answer

How This Comparison Was Built

The Core Idea: Two Ways to Write a Sentence

Why Diffusion Is Faster

The Other Superpower: Infilling and Revision

The Products

1. Inception Labs Mercury - Best for the first production-ready diffusion LLM

2. Google Gemini Diffusion - Best signal that a frontier lab is serious about diffusion text

3. LLaDA and the research lineage - Best for understanding where this came from

Autoregressive vs Diffusion at a Glance

Why Speed Actually Matters

The Honest Tradeoffs

Where Diffusion Fits Alongside the Rest of the Stack

Conclusion

Key Takeaways

Frequently Asked Questions

About the Author

Aisha Patel

Explore More Topics

Related Articles

LLM Inference Providers Compared 2026: Groq vs Cerebras vs Together vs Fireworks

Embedding Models Compared 2026: OpenAI vs Voyage vs Cohere vs Gemini vs Nomic

What Is Context Engineering? The 2026 Successor to Prompt Engineering