Type to search posts and projects to navigate

Instrument · Interpretability

Activation probe cost

What does running an activation probe at inference actually cost? Compare a linear probe against the forward pass it rides on, by model and probe scope.

Sources: What runtime interpretability actually costs · Code

Bet 12 / Runtime interpretability

What does an activation probe actually cost?

On paper a probe is a rounding error: one dot product against a hidden state the forward pass already computed. So the "too expensive to run in production" folklore cannot be about the math. It is about the implementation, and the gap between a naive hook and a batched async readout is the whole story.

Arithmetic overhead
%
Memory (bytes) overhead
%
Serving latency overhead
%
Verdict

Where the latency actually goes (log scale) — the arithmetic floor vs the two implementations:

How this is computed

Per generated token: forward pass ≈ 2 × params FLOPs; each linear readout is 2 × d_model FLOPs. The arithmetic floor is that ratio. The bytes version tells the same story: the probe weights for a thousand readouts are single-digit megabytes against the ~14 GB of model weights moved per token.

  • Naive hook: a Python forward hook breaks CUDA graph capture, forcing the decode path back to eager. An open vLLM internals plug-in pencils a decode-time mode at ~25% throughput; naive hooks run an order of magnitude over baseline. Per-readout device-to-host copies add synchronization points on top.
  • Batched async: pack all probes at a layer into one weight matrix, do a single small matmul inside the graph, accumulate in a device buffer, export asynchronously. Cost stays in low single digits.

The cost was never in the math. That is the argument of the post. Overhead figures here are illustrative bands anchored to those measured numbers.

Embed this on your site

Paste this HTML where you want the widget. It stays in sync with the live version, and matches your page in light or dark.

Subhadip Mitra