← Library · Frontier
DFlash Accelerates LLM Inference up to 15x on NVIDIA Blackwell GPUs
Researchers at UC San Diego's z-lab released DFlash, a speculative decoding framework that significantly boosts large language model inference performance. DFlash introduces a block diffusion model that generates an entire block of candidate tokens in a single forward pass, replacing the slower autoregressive draft loop. NVIDIA independently confirmed that DFlash achieves up to 15 times the concurrent user load of standard autoregressive decoding on Blackwell hardware.
Why it matters
DFlash offers substantial speed improvements for LLM inference, addressing a major bottleneck in deploying large language models, especially on cutting-edge hardware like NVIDIA Blackwell GPUs.
Learn one new AI thing every day.
Daily Deck sends you seven plain-English cards like this every morning. Free.
Start free