DEER: Draft with Diffusion, Verify with Autoregressive Models
Draft with Diffusion, Verify with Autoregressive Models β Efficient Lossless Acceleration for LLM Reasoning.
Example decoding demo of DEER: drafting with a diffusion language model and verifying with an autoregressive backbone.(On A100 GPU,backbone:Qwen3-30B-A3B)
As large language models are increasingly deployed in long-context reasoning and agentic workflows, decoding latency becomes a dominant bottleneck. Classical speculative decoding accelerates inference by letting a lightweight drafter propose multiple tokens, which are then checked against a stronger autoregressive (AR) model. However, existing approaches typically rely on AR drafters, which suffer from two structural limitations: they decode strictly left-to-right, and their own early mistakes accumulate and gradually erode agreement with the verifier.
DEER replaces the AR drafter with a diffusion language model (dLLM) that generates entire token blocks in one denoising step, naturally avoiding stepwise uncertainty accumulation and unlocking highly parallel drafting. To make such a dLLM compatible with AR-style prefix continuation, we introduce a two-stage Diffusion-to-Autoregressive alignment pipeline: first aligning the dLLM to continuation-style data, and then refining token accuracy near the verification boundary with a position-weighted objective.
Across code generation benchmarks, DEER achieves significantly longer accepted draftsβup to 32 tokens per blockβand delivers strong end-to-end speedups. On HumanEval with a Qwen3-30B-A3B backbone, DEER reaches up to 5.54Γ acceleration while preserving exact output semantics, substantially outperforming prior speculative decoding systems based on AR drafters.
AR drafters generate drafts token by token, conditioning on unverified tokens. Any mismatch with the verifier early in the sequence can be amplified step after step, shrinking the acceptance region and sharply limiting usable draft length.
DEER uses a masked, blockwise dLLM drafter. All candidate tokens in a block are predicted jointly from the same prefix, instead of being chained autoregressively. This breaks the feedback loop where draft errors feed into future draft predictions, leading to much more stable acceptance behavior at larger depths.
Diffusion language models naturally support multi-token, parallel generation: a noisy sequence is iteratively denoised into a clean discrete token block. DEER tailors this capability to speculative decoding, combining one-step denoising with AR verification.
This design shifts most of the compute to a parallelizable drafter while keeping the AR model as a lightweight, exact verifier. The result is a scalable and lossless acceleration scheme that is compatible with modern LLM backbones and standard KV-cache implementations.
Figure 1: Motivation and uncertainty accumulation concept
DEER consists of a dLLM drafter and an AR verifier bound together by a two-stage Diffusion-to-AR (D2A) alignment procedure, followed by blockwise speculative decoding at inference time.
A pretrained diffusion language model is originally trained to denoise full sequences, not to condition on prefixes. In Stage I, we adapt it to act like an AR continuation model:
[SEP] token to mark the boundary.This teaches the dLLM to view the question plus partial answer as a prefix and to complete the masked part in a way that matches the teacher distribution.
Speculative decoding is especially sensitive to the first few tokens after the prefix, where verification begins. Stage II focuses the dLLM's capacity on this region:
Together, these two stages yield a drafter that is both globally coherent and locally precise around the verification boundary.
Figure 2: Overview of the DEER training and inference pipeline (Stage I & II alignment plus Stage III speculative decoding)
At inference, given a prefix x1:j, the dLLM proposes a block of k draft tokens in parallel. The AR verifier then walks through the block position by position:
Since all draft tokens are predicted from the same prefix, they do not depend on earlier draft decisions, preventing the kind of cascading divergence that plagues AR drafters.
DEER is evaluated on multiple code benchmarks such as HumanEval, MBPP, CodeAlpaca (Python subset), and LiveCodeBench with different Qwen backbones. The AR verifier reuses the original model weights, so solution quality is preserved while decoding becomes faster.
For a Qwen3-30B-A3B backbone at zero temperature, DEER roughly doubles the average number of accepted tokens per cycle compared to strong AR-based drafters, and translates this into substantial end-to-end speedups.
These gains persist across model sizes, indicating that controlling error accumulation is key to high-throughput speculative decoding for modern LLMs.
| Model | Benchmark | Baseline (AR drafter) Speedup | DEER Speedup | Max Accepted Tokens |
|---|---|---|---|---|
| Qwen3-4B | Code (avg.) | β2.3Γ | β2.8β3.0Γ | 32 |
| Qwen3-8B | Code (avg.) | β2.4Γ | β3.0Γ+ | 32 |
| Qwen3-14B | Code (avg.) | β2.4Γ | β3.1β3.7Γ | 32 |
| Qwen3-30B-A3B | HumanEval | β2.4Γ | β5.5Γ | 32 |
Figure 2: Accepted token length across different Qwen backbones
The drafter's distribution is queried only once per block, while the AR model is evaluated for lightweight token-wise verification. Under standard assumptions, this realizes exact decoding while significantly reducing effective per-token latency.