Type to search posts and projects to navigate

DeepSeek DSpark: Speculation Is a Scheduling Problem

Last year I built speculative decoding from scratch and wrote about it here. I benchmarked it on a single GPU, got my speedups, and walked away thinking I understood it. Reading DeepSeek’s DSpark paper this week, I realized my mental model had a silent assumption in it: the GPU is idle. That assumption is fine on a test bench and wrong on a fleet, and it turns out to be the difference between spec decoding as a demo and spec decoding as something you’d trust in production.

The internet already has forty posts walking through the V4 model cards, so this won’t be another one. Instead, this post is organized around what you can actually do with DSpark’s two core ideas: how to tell if your serving stack has the problem they solve, whether the technique is worth your time at all, and how to train a drafter for your own model with the DeepSpec code DeepSeek released. My wrong assumption is just the doorway in.

Speculation is a scheduling problem, not a drafting problem

Quick recap of the mechanics. A small draft model proposes a block of tokens, the target model verifies the whole block in one forward pass, rejection sampling accepts the longest valid prefix plus one bonus token. Lossless by construction. Per-token latency is (T_draft + T_verify) / τ, where τ is tokens accepted per cycle.

On an idle GPU, verifying eight tokens costs about the same as verifying one, because decoding is memory bound and the extra tokens ride along in the batch. So you verify everything and it feels free. On a busy fleet it is not free: every speculative token you verify is compute taken from someone else’s request. Verify eight, accept two, and you’ve spent six tokens of a stranger’s latency on nothing.

DSpark’s response is a confidence head that scores each drafted token plus a scheduler that watches machine load. Quiet fleet: verify the full block. Slammed fleet: verify only the confident prefix, discard the tail unverified. Speculation becomes elastic, expanding into spare capacity and contracting out of the way. The shape of one decode cycle:

flowchart LR
    D["Drafter<br/><i>proposes k tokens</i>"] --> CH["Confidence head<br/><i>survival score per position</i>"]
    L["GPU load<br/><i>batch occupancy</i>"] --> S
    CH --> S{"Scheduler<br/><i>how much is worth verifying?</i>"}
    S -->|quiet fleet| VF["Verify all k"]
    S -->|busy fleet| VP["Verify confident prefix only"]
    VF --> T["Target forward pass<br/><i>one batch, one step</i>"]
    VP --> T
    T --> A["Longest accepted prefix<br/>+ 1 bonus token"]
    style S fill:#e76f51,color:#fff
    style A fill:#2d6a4f,color:#fff

What this means for you: if you run speculative decoding in a batched serving stack today, your acceptance rate is not the number to watch. Three numbers together tell you whether you have this problem: accepted length τ per verify cycle, batch occupancy at your p95, and the wasted-verify fraction, which is just (k − (τ − 1)) / k for draft length k. Your serving stack already exposes the raw material; vLLM, for instance, reports spec-decode acceptance counters you can derive the rest from. Plot wasted verify against occupancy, and if it climbs as the batch fills, you are paying the tax. Measure per-user latency at your p95 load, not at idle. If spec decoding looks great in your benchmarks and mediocre in production, wasted verification under load is the first suspect, and a fixed draft length is probably the cause. Even without adopting DSpark wholesale, making your speculation length load-aware (even crudely, like dropping draft length when batch occupancy crosses a threshold) captures part of the win.

Suffix decay, and why one token of context fixes most of it

Parallel drafters like Medusa and DFlash draft the whole block in one shot. Fast, but each position guesses independently, so accuracy rots deeper into the block. Position one is usually right, position five is a coin flip. This suffix decay is what caps τ. Sequential drafters like Eagle3 avoid it by drafting token by token, but they give back draft speed to get there, which is exactly the trade DSpark refuses to make.

DSpark keeps the parallel backbone and adds a tiny sequential head on top. Not an RNN over the whole prefix. A Markov head that conditions on exactly one token, the previous one, through a cheap low-rank projection. That alone holds acceptance steady deep into the block. The authors tried a full RNN head; it barely helped beyond the one-token version, so the one-token version ships.

Since DeepSpec ships all three drafter families side by side, the choice you’ll actually face is which one to train. Here is the trade in one table:

Drafter How it drafts Suffix decay Draft cost per round Train it if
Eagle3 Sequentially, token by token None, but drafting itself is the bottleneck Highest You need maximum acceptance and your blocks stay short anyway
DFlash Whole block in one parallel pass Severe; position five is a coin flip Lowest You need the cheapest possible draft path and accept short prefixes
DSpark Parallel backbone plus a one-token Markov head Mostly fixed DFlash plus a low-rank projection, near-free Almost always; it is DFlash speed with acceptance that survives deep blocks

What this means for you: if you use a parallel drafter, suffix decay is your τ ceiling, and the fix is embarrassingly cheap. It also means longer draft blocks are back on the table: with the sequential head, stretching the block from 4 to 16 tokens costs almost nothing per round while meaningfully raising accepted length. If you tuned your draft length down to 3 or 4 because acceptance fell off a cliff, that tuning may be obsolete.

Should you use DSpark? An honest decision guide

You’re a good candidate if you control your model weights and your serving stack, you serve enough traffic that GPUs sit at meaningful utilization, and a real share of your workload is structured, meaning code, math, extraction, or tool calls. Structured outputs have fewer plausible next tokens, so accepted prefixes run long and the technique shines. Coding agents are the best case.

You’re a poor candidate if you rent inference through someone else’s API, because none of this is a switch you can flip from outside. You’re also a poor candidate on consumer hardware with a single user: speculative decoding needs the draft path to be roughly 10 to 30 times faster than the target, and with a weak speed ratio it can end up slower than plain decoding even though the outputs stay correct.

And one warning that doesn’t fit in a headline: drafters do not transfer. A draft head learns the output distribution of one specific target model. The checkpoint trained for stock V4 is the wrong drafter for your fine-tune, and wrong again for the same model in a different reasoning mode.

Can you run it today, without training anything?

Mostly, with early-adopter caveats. If your target model is stock V4, the released DSpark checkpoints load in vLLM and SGLang today; the community 1.5x number in the benchmarks section below came from exactly that path. What you’re getting is early support rather than a turnkey flag: formal DSpark spec-decode integration is still landing in both projects (there are open feature requests tracking it in SGLang and in vllm-ascend), so budget for version pinning and some config friction. Still, running a shipped checkpoint is a far lower bar than training a drafter, so try that first and only reach for DeepSpec when your weights diverge from stock.

How to train a DSpark drafter for your own model with DeepSpec

The genuinely useful part of this release is DeepSpec, the MIT-licensed codebase DeepSeek published for training and evaluating draft models. It ships implementations of DSpark, DFlash, and Eagle3, with released checkpoints targeting Qwen3 at 4B, 8B, and 14B plus Gemma-4-12B, and evaluation sets spanning math, code, and chat.

The pipeline, end to end:

flowchart LR
    P["Prompts<br/><i>from your real traffic</i>"] --> R["Regenerate answers<br/><i>with your target model</i>"]
    R --> C["Target cache<br/><i>~38 TB for Qwen3-4B</i>"]
    C --> TR["Train draft head<br/><i>target frozen, reuses its<br/>embedding and output layers</i>"]
    TR --> E{"Accepted length<br/><i>on held-out math, code, chat</i>"}
    E -->|holds up| DEP["Wire into serving stack"]
    E -->|falls short| P
    style C fill:#e76f51,color:#fff
    style DEP fill:#2d6a4f,color:#fff

The workflow is refreshingly honest about the drafter-target coupling, and it is concrete enough to walk through for real. Below is the actual run for the repo’s default target, Qwen3-4B. Everywhere Qwen/Qwen3-4B appears, your own fine-tune drops in; the whole point of the pipeline is that the drafter learns your model’s distribution, so there is no shortcut around regenerating the data yourself.

The full run for Qwen3-4B, command by command

Step 0: environment. Clone the repo and install dependencies. The default configs and scripts assume a single node with 8 GPUs.

git clone https://github.com/deepseek-ai/DeepSpec.git
cd DeepSpec
python -m pip install -r requirements.txt

Step 1: get prompts. The released checkpoints were trained on the open-perfectblend dataset, and the repo script downloads and splits it. This is the step to be opinionated about: if you serve real traffic, prompts sampled from your own logs beat a public blend, because acceptance lives and dies on distribution match.

python scripts/data/download_and_split.py \
    --dataset-name mlabonne/open-perfectblend \
    --test-size 0.05 \
    --train-output-path train_datasets/perfectblend_train.jsonl \
    --test-output-dir eval_datasets \
    --skip-existing

Step 2: regenerate the answers with your target model. Not the dataset’s answers. Yours. DeepSpec does this by standing up SGLang servers, one per GPU, and streaming the prompts through:

bash scripts/data/launch_sglang_server.sh

python scripts/data/generate_train_data.py \
    --model Qwen/Qwen3-4B \
    --server-address \
        127.0.0.1:30000 127.0.0.1:30001 127.0.0.1:30002 127.0.0.1:30003 \
        127.0.0.1:30004 127.0.0.1:30005 127.0.0.1:30006 127.0.0.1:30007 \
    --concurrency 32 \
    --temperature 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
    --max-tokens 4096 \
    --disable-thinking \
    --resume \
    --input-file-path train_datasets/perfectblend_train.jsonl \
    --output-file-path train_datasets/qwen3_4b/perfectblend_train_regen.jsonl

Note --disable-thinking: every released checkpoint was trained on non-thinking outputs. If you deploy a thinking mode, generate this data in thinking mode instead; the distributions differ enough to cost you accepted length.

Step 3: build the target cache. This runs the frozen target over the regenerated data and stores what the draft head will train against. It is also where the storage bill lands: roughly 38 TB for this default setup.

python scripts/data/prepare_target_cache.py \
    --config config/dspark/dspark_qwen3_4b.py \
    --train-data-path train_datasets/qwen3_4b/perfectblend_train_regen.jsonl \
    --output-dir ${HOME}/.cache/deepspec/qwen3_4b_target_cache \
    --local-batch-size 16

Step 4: train. One worker per GPU; the config is picked via config_path (the DSpark configs live under config/dspark/) and individual fields can be overridden with --opts. The head itself is small because it reuses the target’s embedding and output layers; the compute already went into the cache.

bash scripts/train/train.sh

Checkpoints land in ~/checkpoints/<project_name>/<exp_name>/step_*.

Step 5: measure accepted length before touching your serving stack. Set target_name_or_path and draft_name_or_path (your new checkpoint, or a released one as a baseline) in the eval script and run it over the held-out sets, which span gsm8k, math500, aime25, humaneval, mbpp, livecodebench, mt-bench, alpaca, and arena-hard-v2:

bash scripts/eval/eval.sh

Two sanity checks worth doing once the eval runs. Run the released DSpark checkpoint for your base model through the same eval to calibrate what good looks like. And compare your accepted length on the code sets against the chat sets: if code is not clearly ahead, your training data does not match your traffic, and the problem is in steps 1 and 2, not in training.

Budget warning, restated as a summary: the expensive step is the data, not the training. If you control your weights, this is still a weekend-to-a-week project rather than a research program, but it is not a laptop project.

DSpark benchmarks, for the record

I’ve argued the numbers aren’t the story, but you should have them in one place. Offline, DSpark’s accepted length beats Eagle3 by roughly 27 to 31 percent and DFlash by 16 to 18 percent across the three Qwen3 sizes, and a 2-layer DSpark drafter outperforms a 5-layer DFlash. Growing the draft block from 4 to 16 tokens costs only 0.2 to 1.3 percent extra per-round latency. In production, against DeepSeek’s previous single-token MTP-1 baseline at matched capacity, per-user generation runs 60 to 85 percent faster on V4-Flash and 57 to 78 percent faster on V4-Pro, with aggregate throughput up 51 to 52 percent at fixed per-user speed targets. The shipped config is DSpark-5, five-token blocks with the Markov head.

Two calibration notes. The roughly 400 and 660 percent figures you’ll see elsewhere (406 percent on V4-Pro at a 50 token-per-second per-user target, 661 percent on V4-Flash at 120) are real but describe throughput at speed targets strict enough that the single-token baseline nearly collapses, so the ratio balloons. They mark the corner of the serving frontier, not the typical case. And the baseline matters: an early community vLLM benchmark on V4-Flash landed around 1.5x over MTP-1 and 2.3x over plain decoding, which is the kind of number you should expect to reproduce, not the headline.

Build intuition before you build anything

Before you spend a week on DeepSpec, spend a minute on the simulator below. It runs three decoding policies against the same GPU budget: plain decoding, fixed-length speculation, and DSpark’s confidence-scheduled version.

Here is the experiment to run. Set the workload to code, then drag concurrent requests up and watch the frontier chart. At some point the fixed-length speculation curve crosses below plain decoding. That crossover is the entire argument of this post in one pixel: past saturation, a fixed draft length keeps paying to verify tokens that were never going to be accepted, and speculation becomes a tax on every other request. DSpark’s curve doesn’t cross, because the scheduler shrinks its verify window as utilization climbs. You can watch it happen in the lanes above the chart, where the scheduled policy starts tearing the tail off its blocks while the fixed policy keeps burning red.

Then switch the workload to chat and see how much earlier everything degrades when acceptance decays fast, and toggle the Markov head to see why one token of context moves the whole frontier. The sliders map directly onto your deployment questions: how accurate is your drafter, how long can your blocks be, and how busy are your machines.

Bet 02 / Scheduled speculation

Speculation is free until the GPU is busy.

On an idle GPU, verifying a block of drafted tokens rides along with the batch, so fixed-length speculation looks great on a test bench. On a shared fleet, every speculative token you verify is compute taken from someone else's request. Three policies share the same capacity here: plain decoding, fixed-length speculation, and a DSpark-style scheduler that shrinks its verify window as utilization climbs. Drag concurrency up and watch the fixed policy cross below plain.

plain
fixed k
dspark
accepted verified but rejected (wasted) drafted, never verified bonus token
Verdict at N
Fixed spec vs plain
×
DSpark vs plain
×
Scheduler verifies
plain decoding fixed spec DSpark scheduled crossover
policyverifyτ / cycleper-user speedvs plainwasted verify

How this is computed

A toy capacity model: a decode step costs max(1, total tokens / capacity) time units, so batched verify tokens ride free below saturation and stretch everyone's step time above it. Draft position i is accepted with probability p₁·dⁱ (workload sets p₁ and decay d; the Markov head raises d). Tokens per cycle τ = 1 + Σ qᵢ over the verified prefix, where qᵢ is the cumulative acceptance probability; per-user speed is τ / step.

  • Plain decodes one token per step. Fixed always verifies the full block, so past saturation it pays full price for tokens that were never going to be accepted and drops below plain.
  • DSpark verifies only the confident prefix: the threshold tightens as utilization crosses ~70%, solved here as a small fixed point. Speculation expands into spare capacity and contracts out of the way.
  • Constants are illustrative, not fitted to the paper. The crossover, and the scheduler refusing to cross, are the real phenomenon.

The mechanism is from DeepSeek's DSpark / DeepSpec release: a parallel drafter with a one-token Markov head against suffix decay, plus confidence-scheduled verification that adapts to fleet load.

(Standalone and embeddable version: Speculation under load.)

The constants in the toy are illustrative, not fitted to the paper. The dynamics are the real thing.

Sources

Cite this article

Mitra, Subhadip. (2026, June). DeepSeek DSpark: Speculation Is a Scheduling Problem. Subhadip Mitra. Retrieved from https://subhadipmitra.com/blog/2026/deepseek-dspark-speculative-decoding-production/

@article{mitra2026deepseek-dspark-speculation-is-a-scheduling-problem,
  title   = {DeepSeek DSpark: Speculation Is a Scheduling Problem},
  author  = {Mitra, Subhadip},
  journal = {Subhadip Mitra},
  year    = {2026},
  month   = {Jun},
  url     = {https://subhadipmitra.com/blog/2026/deepseek-dspark-speculative-decoding-production/}
}
Share this article

Get More Like This

Strategic insights on Data, AI, and Cloud transformation delivered to your inbox.

Free insights. No spam. Unsubscribe anytime.

Subhadip Mitra