← Library · Frontier

DFlash Speculative Decoding Boosts LLM Inference on NVIDIA Blackwell by up to 15x

DFlash, an open-source block diffusion model for speculative decoding, significantly accelerates LLM inference on NVIDIA Blackwell GPUs. Developed by UC San Diego researchers, DFlash uses a block-diffusion drafter that generates an entire block of candidate tokens in a single parallel pass, which the larger target model then verifies. This approach leads to up to a 15x throughput improvement for gpt-oss-120b and nearly doubles interactivity for Llama 3.1 8B compared to state-of-the-art EAGLE-3.

Why it matters

DFlash offers a substantial leap in LLM inference performance, making large language models more efficient and interactive, especially on NVIDIA Blackwell GPUs. This will enable faster and more cost-effective deployment of advanced AI applications.

Learn one new AI thing every day.

Daily Deck sends you seven plain-English cards like this every morning. Free.

Start free

DFlash Speculative Decoding Boosts LLM Inference on NVIDIA Blackwell by up to 15x

Learn one new AI thing every day.

Related frontiers