Type to search posts and projects to navigate

Instrument · Inference

Memory-bandwidth roofline

Is a decode workload memory-bound or compute-bound? Plug in the model, batch size, precision, and GPU, and watch the operating point cross the ridge.

Sources: Making LLMs faster · Code

Bet 02 / Memory bandwidth

Memory-bound, or compute-bound? And what moves the wall.

Decode reads the whole model from memory for every token and does almost no arithmetic, so it lives far below the ridge point. Prefill is the opposite. The roofline is the diagnosis; the levers your research pulls, KV-cache pressure and kernel efficiency, are on it too. Set your configuration and watch where it lands and how much headroom is left.

State
Arithmetic intensity
FLOP/byte
Ridge point
FLOP/byte
KV cache vs weights
%
Effective bandwidth
Time / token
Throughput
tok/s
Compute utilization
%
peak roofline achieved (kernel) operating point log–log
How this is computed
  • Decode reads the weights once per token plus the whole KV cache: intensity = 2·params·B / (weight bytes + KV bytes), which is tiny, so you are memory-bound.
  • Prefill processes all prompt tokens in parallel, amortizing the weight load: intensity ≈ 2·context·B / bytes-per-param, high enough to be compute-bound.
  • KV cache = 2 · layers · context · B · (d_model / GQA groups) · bytes. At long context it rivals the weights, which is why decode gets slower as the conversation grows.
  • Kernel efficiency is the fraction of peak bandwidth actually reached. Effective bandwidth = efficiency × peak, so a PyTorch kernel at 11% and a Triton kernel at 88% differ by ~8× on the same memory-bound op.

Peak figures are approximate dense tensor throughput and HBM bandwidth per vendor spec, scaled by precision.

Embed this on your site

Paste this HTML where you want the widget. It stays in sync with the live version, and matches your page in light or dark.

Subhadip Mitra