Speculation under load
Speculative decoding looks free on an idle GPU and becomes a tax on a busy fleet. Three policies share one capacity budget: plain decoding, fixed-length speculation, and a DSpark-style scheduler that shrinks its verify window as utilization climbs.
Sources: DSpark in production · DSpark paper · Code
Bet 02 / Scheduled speculation
Speculation is free until the GPU is busy.
On an idle GPU, verifying a block of drafted tokens rides along with the batch, so fixed-length speculation looks great on a test bench. On a shared fleet, every speculative token you verify is compute taken from someone else's request. Three policies share the same capacity here: plain decoding, fixed-length speculation, and a DSpark-style scheduler that shrinks its verify window as utilization climbs. Drag concurrency up and watch the fixed policy cross below plain.
| policy | verify | τ / cycle | per-user speed | vs plain | wasted verify |
|---|
How this is computed
A toy capacity model: a decode step costs max(1, total tokens / capacity) time units, so batched verify tokens ride free below saturation and stretch everyone's step time above it. Draft position i is accepted with probability p₁·dⁱ (workload sets p₁ and decay d; the Markov head raises d). Tokens per cycle τ = 1 + Σ qᵢ over the verified prefix, where qᵢ is the cumulative acceptance probability; per-user speed is τ / step.
- Plain decodes one token per step. Fixed always verifies the full block, so past saturation it pays full price for tokens that were never going to be accepted and drops below plain.
- DSpark verifies only the confident prefix: the threshold tightens as utilization crosses ~70%, solved here as a small fixed point. Speculation expands into spare capacity and contracts out of the way.
- Constants are illustrative, not fitted to the paper. The crossover, and the scheduler refusing to cross, are the real phenomenon.
The mechanism is from DeepSeek's DSpark / DeepSpec release: a parallel drafter with a one-token Markov head against suffix decay, plus confidence-scheduled verification that adapts to fleet load.