← Library · Frontier

DFlash Speculative Decoding Brings up to 15x LLM Inference Boost on NVIDIA Blackwell

Researchers at UC San Diego, in collaboration with NVIDIA, have released DFlash, an open-source block diffusion model for speculative decoding, designed to drastically improve large language model (LLM) inference performance on NVIDIA Blackwell GPUs. DFlash employs a block-diffusion drafter that generates an entire block of candidate tokens in a single forward pass, turning sequential drafting into parallel GPU work while preserving output quality. This approach delivers up to a 15x throughput improvement for gpt-oss-120b and significantly boosts interactivity for other models like Llama 3.1 8B, with minimal code changes required for integration into frameworks like vLLM and SGLang.

Why it matters

DFlash dramatically accelerates LLM inference, making AI applications much more responsive and capable of serving more users simultaneously. The method's compatibility with existing inference frameworks ensures that developers can easily adopt this technology for substantial performance gains on NVIDIA's latest hardware.

Learn one new AI thing every day.

Daily Deck sends you seven plain-English cards like this every morning. Free.

Start free