Memory-bandwidth roofline

Bet 02 / Memory bandwidth

Memory-bound, or compute-bound? And what moves the wall.

Decode reads the whole model from memory for every token and does almost no arithmetic, so it lives far below the ridge point. Prefill is the opposite. The roofline is the diagnosis; the levers your research pulls, KV-cache pressure and kernel efficiency, are on it too. Set your configuration and watch where it lands and how much headroom is left.

ModeliModel size. Sets the weight bytes to stream, plus layers and hidden dimension used for the KV cache.

GPUiThe accelerator: its peak compute (FLOP/s) and memory bandwidth set the roofline.

PhaseiDecode generates one token per step and is memory-bound. Prefill processes the whole prompt in parallel and is compute-bound.

PrecisioniThe number format for weights and KV cache. Lower precision moves fewer bytes and adds compute throughput.

Batch sizeiSequences processed at once. A bigger batch reuses each weight load across more work, raising arithmetic intensity.

Context lengthiTokens already in the sequence. In decode every step re-reads the whole KV cache, so long context adds memory traffic that can rival the weights.

GQA groupsiGrouped-query attention: how many query heads share each key/value head. Higher groups shrink the KV cache proportionally.

Kernel efficiencyiFraction of peak bandwidth the kernel actually achieves. PyTorch defaults sit near 11%; a hand-tuned Triton kernel reaches ~88%.

State

—

Arithmetic intensity

— FLOP/byte

Ridge point

— FLOP/byte

KV cache vs weights

—%

Effective bandwidth

—

Time / token

—

Throughput

— tok/s

Compute utilization

—%

peak roofline achieved (kernel) operating point log–log

How this is computed

Decode reads the weights once per token plus the whole KV cache: intensity = 2·params·B / (weight bytes + KV bytes), which is tiny, so you are memory-bound.
Prefill processes all prompt tokens in parallel, amortizing the weight load: intensity ≈ 2·context·B / bytes-per-param, high enough to be compute-bound.
KV cache = 2 · layers · context · B · (d_model / GQA groups) · bytes. At long context it rivals the weights, which is why decode gets slower as the conversation grows.
Kernel efficiency is the fraction of peak bandwidth actually reached. Effective bandwidth = efficiency × peak, so a PyTorch kernel at 11% and a Triton kernel at 88% differ by ~8× on the same memory-bound op.

Peak figures are approximate dense tensor throughput and HBM bandwidth per vendor spec, scaled by precision.

Embed this on your site

Paste this HTML where you want the widget. It stays in sync with the live version, and matches your page in light or dark.

<iframe src="https://subhadipmitra.com/instruments/roofline/embed/" width="100%" height="980" loading="lazy" style="border:0;max-width:760px" title="Memory-bandwidth roofline — subhadipmitra.com"></iframe>