DeepSeek Releases DSpark, Accelerating LLM Inference with Speculative Decoding
DeepSeek has launched DSpark, a speculative decoding framework designed to accelerate per-user generation for its DeepSeek-V4 models. DSpark pairs a parallel draft backbone with a tiny sequential head and incorporates a confidence head and a load-aware scheduler to verify more tokens when GPUs are idle and fewer when busy. This serving optimization results in 60-85% faster per-user generation for DeepSeek-V4-Flash and 57-78% for V4-Pro compared to their MTP-1 baseline, without compromising output quality. The checkpoints and training code are open-source.
DSpark significantly boosts the inference speed of large language models, making them more efficient and cost-effective for production environments.
Learn one new AI thing every day.
Daily Deck sends you seven plain-English cards like this every morning. Free.
Start free