← Library · Frontier

UC San Diego's DFlash Achieves 15x Speedup in Speculative Decoding

Researchers at UC San Diego's z-lab have released DFlash, a speculative decoding method that replaces the traditional autoregressive draft loop with a lightweight block diffusion model. This allows DFlash to propose an entire token block in a single forward pass, which the target model then verifies in parallel. On NVIDIA Blackwell hardware, DFlash achieved over 15 times the concurrent user load compared to standard autoregressive decoding. This breakthrough addresses a long-standing bottleneck in speculative decoding, where previous methods like EAGLE-3 only offered limited speedups.

Why it matters

DFlash dramatically improves LLM serving efficiency, particularly for latency-sensitive interactive applications like coding agents and real-time chat. Its ability to create longer 'accepted runs' earlier means fewer passes for the larger, slower model.

Learn one new AI thing every day.

Daily Deck sends you seven plain-English cards like this every morning. Free.

Start free