← Library · Frontier

DFlash Speculative Decoding Boosts LLM Inference on NVIDIA Blackwell by up to 15x

DFlash, an open-source block diffusion model for speculative decoding, significantly accelerates LLM inference on NVIDIA Blackwell GPUs. Developed by UC San Diego researchers, DFlash uses a block-diffusion drafter that generates an entire block of candidate tokens in a single parallel pass, which the larger target model then verifies. This approach leads to up to a 15x throughput improvement for gpt-oss-120b and nearly doubles interactivity for Llama 3.1 8B compared to state-of-the-art EAGLE-3.

Why it matters

DFlash offers a substantial leap in LLM inference performance, making large language models more efficient and interactive, especially on NVIDIA Blackwell GPUs. This will enable faster and more cost-effective deployment of advanced AI applications.

Learn one new AI thing every day.

Daily Deck sends you seven plain-English cards like this every morning. Free.

Start free