Instrument · Inference
Memory-bandwidth roofline
Is a decode workload memory-bound or compute-bound? Plug in the model, batch size, precision, and GPU, and watch the operating point cross the ridge.
Sources: Making LLMs faster · Code
Bet 02 / Memory bandwidth
Memory-bound, or compute-bound? And what moves the wall.
Decode reads the whole model from memory for every token and does almost no arithmetic, so it lives far below the ridge point. Prefill is the opposite. The roofline is the diagnosis; the levers your research pulls, KV-cache pressure and kernel efficiency, are on it too. Set your configuration and watch where it lands and how much headroom is left.
State
—
Arithmetic intensity
— FLOP/byte
Ridge point
— FLOP/byte
KV cache vs weights
—%
Effective bandwidth
—
Time / token
—
Throughput
— tok/s
Compute utilization
—%
How this is computed
- Decode reads the weights once per token plus the whole KV cache: intensity
= 2·params·B / (weight bytes + KV bytes), which is tiny, so you are memory-bound. - Prefill processes all prompt tokens in parallel, amortizing the weight load: intensity
≈ 2·context·B / bytes-per-param, high enough to be compute-bound. - KV cache
= 2 · layers · context · B · (d_model / GQA groups) · bytes. At long context it rivals the weights, which is why decode gets slower as the conversation grows. - Kernel efficiency is the fraction of peak bandwidth actually reached. Effective bandwidth
= efficiency × peak, so a PyTorch kernel at 11% and a Triton kernel at 88% differ by ~8× on the same memory-bound op.
Peak figures are approximate dense tensor throughput and HBM bandwidth per vendor spec, scaled by precision.