DEER Logo DEER: Draft with Diffusion, Verify with Autoregressive Models

Draft with Diffusion, Verify with Autoregressive Models β€” Efficient Lossless Acceleration for LLM Reasoning.

Zicong Cheng1,2,3, Guo-Wei Yang2, Jia Li1, Zhijie Deng3, Meng-Hao Guo1βœ‰, Shi-Min Hu1
1 Tsinghua University
2 Proxseer Inc
3 Shanghai Jiao Tong University
βœ‰ Corresponding author: gmh@tsinghua.edu.cn
Speculative Decoding Diffusion LLM Drafter Blockwise Parallel Tokens Lossless Acceleration

🎬 DEER Demo

Example decoding demo of DEER: drafting with a diffusion language model and verifying with an autoregressive backbone.(On A100 GPU,backbone:Qwen3-30B-A3B)

πŸ“„ Abstract

As large language models are increasingly deployed in long-context reasoning and agentic workflows, decoding latency becomes a dominant bottleneck. Classical speculative decoding accelerates inference by letting a lightweight drafter propose multiple tokens, which are then checked against a stronger autoregressive (AR) model. However, existing approaches typically rely on AR drafters, which suffer from two structural limitations: they decode strictly left-to-right, and their own early mistakes accumulate and gradually erode agreement with the verifier.

DEER replaces the AR drafter with a diffusion language model (dLLM) that generates entire token blocks in one denoising step, naturally avoiding stepwise uncertainty accumulation and unlocking highly parallel drafting. To make such a dLLM compatible with AR-style prefix continuation, we introduce a two-stage Diffusion-to-Autoregressive alignment pipeline: first aligning the dLLM to continuation-style data, and then refining token accuracy near the verification boundary with a position-weighted objective.

Across code generation benchmarks, DEER achieves significantly longer accepted draftsβ€”up to 32 tokens per blockβ€”and delivers strong end-to-end speedups. On HumanEval with a Qwen3-30B-A3B backbone, DEER reaches up to 5.54Γ— acceleration while preserving exact output semantics, substantially outperforming prior speculative decoding systems based on AR drafters.

πŸš€ Why DEER?

1. No Left-to-Right uncertainty Accumulation

AR drafters generate drafts token by token, conditioning on unverified tokens. Any mismatch with the verifier early in the sequence can be amplified step after step, shrinking the acceptance region and sharply limiting usable draft length.

DEER uses a masked, blockwise dLLM drafter. All candidate tokens in a block are predicted jointly from the same prefix, instead of being chained autoregressively. This breaks the feedback loop where draft errors feed into future draft predictions, leading to much more stable acceptance behavior at larger depths.

2. Fully Parallel Drafting in Discrete Space

Diffusion language models naturally support multi-token, parallel generation: a noisy sequence is iteratively denoised into a clean discrete token block. DEER tailors this capability to speculative decoding, combining one-step denoising with AR verification.

This design shifts most of the compute to a parallelizable drafter while keeping the AR model as a lightweight, exact verifier. The result is a scalable and lossless acceleration scheme that is compatible with modern LLM backbones and standard KV-cache implementations.

Conceptual comparison between AR-based drafting and DEER's diffusion-based block drafting

Figure 1: Motivation and uncertainty accumulation concept

βš™οΈ DEER Pipeline

DEER consists of a dLLM drafter and an AR verifier bound together by a two-stage Diffusion-to-AR (D2A) alignment procedure, followed by blockwise speculative decoding at inference time.

Stage I β€” AR-Style Continuation Distillation

A pretrained diffusion language model is originally trained to denoise full sequences, not to condition on prefixes. In Stage I, we adapt it to act like an AR continuation model:

  • Start from teacher-generated answers from an AR backbone.
  • Randomly truncate each answer, append a special [SEP] token to mark the boundary.
  • Mask the suffix and train the dLLM to reconstruct only the masked continuation.

This teaches the dLLM to view the question plus partial answer as a prefix and to complete the masked part in a way that matches the teacher distribution.

Stage II β€” Prefix-Conditioned Accuracy Refinement

Speculative decoding is especially sensitive to the first few tokens after the prefix, where verification begins. Stage II focuses the dLLM's capacity on this region:

  • Mask only the last R tokens of the answer rather than the full suffix.
  • Apply exponentially decaying weights, giving higher emphasis to tokens right after the prefix.
  • Optimize a weighted objective that sharpens local alignment exactly where the verifier operates.

Together, these two stages yield a drafter that is both globally coherent and locally precise around the verification boundary.

Conceptual comparison between AR-based drafting and DEER's diffusion-based block drafting

Figure 2: Overview of the DEER training and inference pipeline (Stage I & II alignment plus Stage III speculative decoding)

Stage III β€” Blockwise Draft–Verify Inference

At inference, given a prefix x1:j, the dLLM proposes a block of k draft tokens in parallel. The AR verifier then walks through the block position by position:

  1. Compute the acceptance probability from the ratio between AR and draft distributions.
  2. Stochastically accept the draft token or replace it with an AR sample.
  3. Append the chosen token to the prefix and continue to the next position in the block.

Since all draft tokens are predicted from the same prefix, they do not depend on earlier draft decisions, preventing the kind of cascading divergence that plagues AR drafters.

πŸ“Š Experimental Highlights

Code Generation Benchmarks

DEER is evaluated on multiple code benchmarks such as HumanEval, MBPP, CodeAlpaca (Python subset), and LiveCodeBench with different Qwen backbones. The AR verifier reuses the original model weights, so solution quality is preserved while decoding becomes faster.

For a Qwen3-30B-A3B backbone at zero temperature, DEER roughly doubles the average number of accepted tokens per cycle compared to strong AR-based drafters, and translates this into substantial end-to-end speedups.

Acceptance Length & Speedup

  • Average accepted length: up to β‰ˆ5 tokens per speculative step on Qwen3-30B-A3B, notably higher than strong AR drafters.
  • Maximum accepted length: up to 32 consecutive tokens per block, vs. 7–8 tokens for competitive AR-based methods.
  • Speedup: on HumanEval, DEER reaches around 5.54Γ— acceleration with Qwen3-30B-A3B while maintaining lossless decoding.

These gains persist across model sizes, indicating that controlling error accumulation is key to high-throughput speculative decoding for modern LLMs.

Model Benchmark Baseline (AR drafter) Speedup DEER Speedup Max Accepted Tokens
Qwen3-4B Code (avg.) β‰ˆ2.3Γ— β‰ˆ2.8–3.0Γ— 32
Qwen3-8B Code (avg.) β‰ˆ2.4Γ— β‰ˆ3.0Γ—+ 32
Qwen3-14B Code (avg.) β‰ˆ2.4Γ— β‰ˆ3.1–3.7Γ— 32
Qwen3-30B-A3B HumanEval β‰ˆ2.4Γ— β‰ˆ5.5Γ— 32
result

Figure 2: Accepted token length across different Qwen backbones

🧠 DEER Inference Algorithm

# Blockwise speculative decoding with a dLLM drafter
def deer_decode(ar_model, d_dllm, prefix, k, max_length):
    x = list(prefix)
    while len(x) < max_length:
        # 1) draft k tokens in parallel from the dLLM
        y_hat = d_dllm.block_sample(x, k)
        for i, token in enumerate(y_hat, start=1):
            ar_dist = ar_model.next_token_dist(x)
            draft_dist = d_dllm.cond_dist(x)
            # 2) compute acceptance probability
            alpha = min(1.0, ar_dist[token] / (draft_dist[token] + 1e-9))
            u = random.uniform(0.0, 1.0)
            if u <= alpha:
                x.append(token) # accept draft
            else:
                x.append(sample_from_residual(ar_dist, draft_dist))
            if x[-1] == "<EOS>":
                return x
    return x

The drafter's distribution is queried only once per block, while the AR model is evaluated for lightweight token-wise verification. Under standard assumptions, this realizes exact decoding while significantly reducing effective per-token latency.

πŸ“š BibTeX