Speculation under load

Sources: DSpark in production · DSpark paper · Code

Bet 02 / Scheduled speculation

Speculation is free until the GPU is busy.

On an idle GPU, verifying a block of drafted tokens rides along with the batch, so fixed-length speculation looks great on a test bench. On a shared fleet, every speculative token you verify is compute taken from someone else's request. Three policies share the same capacity here: plain decoding, fixed-length speculation, and a DSpark-style scheduler that shrinks its verify window as utilization climbs. Drag concurrency up and watch the fixed policy cross below plain.

WorkloadiStructured output (code, math) has few plausible next tokens, so acceptance stays high deep into the block. Open-ended chat decays fast.

Markov headiDSpark's fix for suffix decay: a tiny sequential head conditioning on one previous token. It slows acceptance decay, so longer blocks stay useful.

Draft block kiTokens the drafter proposes per cycle. The scheduler may verify fewer than k when the fleet is busy.

Concurrent requests NiHow many requests share the GPU. This is the axis your benchmarks probably held at 1.

GPU capacityiTokens per decode step before step time stretches. Below this, batched verify tokens ride free; above it, speculation costs everyone.

plain

fixed k

dspark

accepted verified but rejected (wasted) drafted, never verified bonus token

Verdict at N

—

Fixed spec vs plain

— ×

DSpark vs plain

— ×

Scheduler verifies

—

plain decoding fixed spec DSpark scheduled crossover

policy	verify	τ / cycle	per-user speed	vs plain	wasted verify

How this is computed

A toy capacity model: a decode step costs max(1, total tokens / capacity) time units, so batched verify tokens ride free below saturation and stretch everyone's step time above it. Draft position i is accepted with probability p₁·dⁱ (workload sets p₁ and decay d; the Markov head raises d). Tokens per cycle τ = 1 + Σ qᵢ over the verified prefix, where qᵢ is the cumulative acceptance probability; per-user speed is τ / step.

Plain decodes one token per step. Fixed always verifies the full block, so past saturation it pays full price for tokens that were never going to be accepted and drops below plain.
DSpark verifies only the confident prefix: the threshold tightens as utilization crosses ~70%, solved here as a small fixed point. Speculation expands into spare capacity and contracts out of the way.
Constants are illustrative, not fitted to the paper. The crossover, and the scheduler refusing to cross, are the real phenomenon.

The mechanism is from DeepSeek's DSpark / DeepSpec release: a parallel drafter with a one-token Markov head against suffix decay, plus confidence-scheduled verification that adapts to fleet load.

Embed this on your site

Paste this HTML where you want the widget. It stays in sync with the live version, and matches your page in light or dark.

<iframe src="https://subhadipmitra.com/instruments/dspark-fleet/embed/" width="100%" height="1280" loading="lazy" style="border:0;max-width:760px" title="Speculation under load — subhadipmitra.com"></iframe>