How expensive is it to run a linear probe at inference time?

On compute, almost nothing. A linear probe is a dot product between a hidden state and a learned weight vector, roughly 8,192 FLOPs per token for a 7B model, about 6 x 10^-7 of the forward pass it rides on, or 0.00006 percent. Even a thousand readouts per token comes to about 0.06 percent.

If probes are so cheap, why do people say they are too expensive to run in production?

The cost is not in the probe math. It is in how probes are usually implemented. A Python forward hook can break CUDA graph capture and force the decode path back to eager mode, and copying scores to the CPU every step inserts a synchronization point into an asynchronous pipeline. Those are implementation artifacts, not intrinsic costs of probing.

What is the difference between the naive and batched probe implementations?

The naive path attaches a Python hook per layer and copies scores to the CPU each decode step, which breaks graph capture and adds syncs. The batched path packs all probes at a layer into one weight matrix, does a single small matmul on the GPU inside the graph, accumulates scores in a device buffer, and exports asynchronously.

Does probe overhead get worse on larger models?

No, it gets better. The forward pass grows faster than the probe readout does, so overhead as a fraction of total compute shrinks as the model grows, which is exactly where the safety stakes are highest.

What Runtime Interpretability Actually Costs, Part 1: The Case for Measuring It

TL;DR: The claim that activation probes are too expensive to run in production is folklore. On compute it is wrong by six orders of magnitude: a linear probe is a dot product, roughly 0.00006 percent of the forward pass it rides on. Any real cost hides in implementation, in CUDA graph breaks, per-step device-to-host copies, and batching bookkeeping, none of which is intrinsic to probing. The systems community has already measured the adjacent problem, general model-internal observability, and the numbers back this up: naive hooks run an order of magnitude over baseline while a well-engineered asynchronous readout costs low single digits. What nobody has measured is the case that decides a safety architecture, an always-on calibrated probe library reported with operator-grade statistics. This is Part 1: the argument, what prior work already settles, the gap it leaves, and the harness aimed at that gap. Part 2 has the numbers.

Back in March, in the model honesty post, I wrote that I wouldn’t run probes on every request because the latency cost is real. It reads like engineering judgment. It was folklore. I had never measured it, so I went to find who had. The answer turned out to be more interesting than “nobody”: the systems community has already measured the adjacent problem, general model-internal observability, and those numbers puncture the folklore on their own. A naive forward hook runs an order of magnitude over baseline, and a readout that stays on the GPU and exports asynchronously costs low single digits. What is still missing is the measurement an operator actually needs before staking a safety case on it, and that gap is what this series is about.

It matters more than a benchmark footnote should, because this number quietly decides the architecture of every runtime monitoring stack being proposed right now. If activation probes are expensive, they get relegated to offline audits and sampled traffic, and the safety case for deployed agents rests on text-level signals alone. If they’re cheap, they belong in the serving path, on every token, always on. Those are two completely different systems, and once you step outside the observability papers, we’re still choosing between them on vibes.

So this is part 1 of settling it properly: the arithmetic, what the existing measurements already show, the specific gap they leave for a safety-probe library, and the design of a harness aimed straight at that gap.

The arithmetic says probes are free

Start with what a probe actually is at inference time. A linear probe is a dot product between a hidden state and a learned weight vector. That’s it. For a 7B-class model with hidden dimension 4096, one probe readout at one layer costs 4096 multiply-accumulates, call it 8,192 FLOPs per token.

The forward pass it’s riding on costs roughly 2 FLOPs per parameter per token, so about 14 GFLOPs per token for the 7B model. The ratio is 6 × 10⁻⁷. The probe is 0.00006 percent of the work you were already doing.

Scale it up to something aggressive: 100 probes at each of 10 layers, a thousand readouts per token. Now you’re at 8.2 MFLOPs per token, which is 0.06 percent of the forward pass. Go to a 70B model and the ratio gets better, because the forward pass grows faster than the probe does.

On compute alone, the “probes are too expensive” position is not just wrong, it’s wrong by six orders of magnitude. If the folklore is true, the cost has to be hiding somewhere else.

The arithmetic, in FLOPs

6 × 10⁻⁷

one linear probe ÷ the 7B forward pass it rides on

1 readout / token 0.00006%

1,000 readouts / token
100 probes × 10 layers 0.06%

same 1,000 on a 70B model ratio shrinks ↓

Wrong by six orders of magnitude. If the folklore is true, the cost is hiding somewhere else.

A systems engineer will object that I picked the flattering denominator, and they’d be half right. Decode is memory-bandwidth bound, not compute bound. You stream the entire weight matrix per token and do almost no arithmetic per byte, so FLOPs were never what set inter-token latency. Fine. Run the same argument in bytes: the probe weights for a thousand readouts are single-digit megabytes to move per token against the ~14 gigabytes you already move for the model, and the score buffer is rounding error. The ratio barely shifts. That the point survives the change of units is the tell. If a real cost exists, it is not in any quantity the arithmetic measures, FLOPs or bytes. It is in the launches, the copies, and the scheduling, and those never show up in a back-of-the-envelope at all.

Where the cost actually hides

And it might be. Modern serving engines are hostile environments for anything that wants to touch intermediate activations. Three suspects, in the order I’d rank them:

CUDA graph breaks. vLLM and friends capture the decode step as a CUDA graph and replay it. It’s one of the larger single wins in modern serving. A naive PyTorch forward hook is Python code executing in the middle of the forward pass, which is incompatible with graph capture, so attaching one can force the whole decode path back to eager mode. You’re not paying for the probe. You’re paying for losing graph replay on everything. This is not a hypothesis: an open vLLM internals plug-in restricts observation to the prefill phase precisely to preserve CUDA graphs, and pencils in a decode-time mode at roughly 25 percent throughput. That is the price of the naive path, already on the record.

Synchronization and device-to-host copies. A hidden state at d=4096 in bf16 is 8 KB per token per tapped layer. The bandwidth is nothing. But if your implementation copies scores to the CPU every decode step, you’ve inserted a synchronization point into a pipeline that lives or dies by staying asynchronous. Small syncs, repeated thousands of times per second, are how fast systems die.

Batching interactions. Continuous batching means the “batch” at any decode step is a shifting mix of requests at different positions. Per-request probe attribution has to slice a packed tensor correctly without adding its own bookkeeping overhead in Python.

Notice what all three have in common: none of them are intrinsic to probing. They’re artifacts of implementation strategy. Which suggests the real claim worth testing is not “probes are cheap” or “probes are expensive” but something more specific:

Probing is nearly free if the readout stays on the GPU and plays nice with graph capture. It’s expensive only when implemented the way a researcher would implement it in a notebook.

Here’s the shape of both paths:

Path A · the notebook way

Python forward hook, per layer

↓

CUDA graph break, back to eager mode

↓

copy scores to CPU every step

↓

sync point in the decode loop

You pay by losing graph replay on everything.

Path B · the serving way

pack all probes per layer into one matrix

↓

one small matmul, inside the graph

↓

scores accumulate in a device buffer

↓

async export every N steps

Marginal cost: a few microseconds per step.

Same tap points, same probes. The only thing that changes is where the readout runs and whether it survives graph capture.

Path A is what every interpretability codebase does today, because interpretability codebases were built for analysis, not serving. Path B packs all probes at a layer into a single weight matrix, does one small matmul per tapped layer inside the graph, accumulates scores in a device buffer, and exports asynchronously. The hypothesis is that Path B’s marginal cost is a few microseconds per step, which at typical inter-token latencies is measurement noise.

There’s a Path C too: fuse the readout into an existing kernel’s epilogue so it costs approximately nothing even in launch overhead. I’ve written enough Triton this year to believe that’s a weekend once Path B works. But I want to know whether it’s even necessary before writing it.

The experiment

The harness is deliberately boring, which is the point. Boring benchmarks are reproducible benchmarks.

Load generator

fixed arrival rate

→

Serving engine

vLLM under test

→

Baseline · no probes

Naive · hooks + copy

Batched · on-GPU readout

Fused · kernel epilogue*

→

Metrics

TTFT, inter-token latency, throughput at p50 / p95 / p99

→

Statistics

5+ runs each, warmup discarded, bootstrap 95% CI

* Fused is the optional fourth condition, built only if Path B leaves a gap worth closing.

The details that matter:

Four conditions on identical hardware and identical workloads: baseline vLLM with no probes attached, the naive hook implementation, the batched on-GPU implementation, and eventually the fused one. Each condition sweeps probe count (1, 10, 100 per layer) and tap depth (1, 4, 10 layers).

Metrics are the ones operators actually care about: time to first token, inter-token latency, and sustained throughput at a fixed request arrival rate, reported at p50, p95, and p99. Plus GPU memory delta and, critically, whether CUDA graph capture survived, because that single boolean probably explains most of whatever gap shows up.

Every number gets at least five runs, warmup discarded, and a bootstrap 95 percent confidence interval. If a difference isn’t distinguishable from run-to-run noise, it gets reported as indistinguishable, not rounded into a story. I built Spark-LLM-Eval (repo) because LLM evaluation without confidence intervals is astrology, and I’m not going to commit the same sin in a latency benchmark.

Predictions, written down before I have results

Two of these are, honestly, no longer wild bets. The observability results above already show naive hooks an order of magnitude over baseline and a well-engineered asynchronous readout in the low single digits. What none of them reports is the number for a calibrated safety-probe library run on every decode token, with the tail-latency statistics an operator would demand before trusting it. So these are falsifiable predictions for my specific configuration, not the general one, and I will be held to them.

The naive implementation (Path A) costs 15 to 40 percent of throughput on a 7B model, and almost all of it traces to the graph break and per-step synchronization rather than the probe math itself.
The batched on-GPU implementation (Path B) costs under 2 percent of throughput at 100 probes across 10 layers, and under 1 percent in the configurations anyone would actually run.
The fused version is statistically indistinguishable from baseline, making it a nice-to-have rather than a requirement.
Overhead as a fraction shrinks as model size grows, so the story gets more favorable exactly where the safety stakes are highest.

The ledger · written before the numbers exist

 # Prediction Predicted Part 2 

P1 Naive path (A) throughput cost, 7B 15–40% ◯

P2 Batched path (B) cost, 100 probes × 10 layers < 2% ◯

P3 Fused path (C) vs baseline ≈ baseline ◯

P4 Overhead fraction vs model size shrinks ↓ ◯

Four falsifiable claims. Part 2 fills the last column, whichever way it lands.

If prediction 2 holds for the safety-probe case, the last serious version of the objection dies with it. Not weakens. Dies. “We can’t afford to run probes in production” stops being a claim about the technique and becomes, at most, a claim about your implementation and your tolerance for a couple of percent.

What it unlocks

The reason to care about a percentage point of throughput is that it determines the monitoring policy you can honestly defend:

The measured number decides the policy you can defend

≤ 2% Always on. Every token, every request.

2–10% Selective. On for agent actions and tool calls.

> 10% Offline only. Audits and sampled traffic.

One benchmark number moves you between three architectures. That is why it is worth measuring instead of guessing.

That ladder hides an assumption worth stating plainly: it is a monitoring ladder, not an enforcement one. If all you want is to log a score, decorate a trace, or raise an alert, the readout can leave the critical path entirely, and Path B’s async export costs you nothing. But the moment a probe is allowed to act, to block a tool call or halt a generation, you need its verdict before you commit the token, and “before you commit the token” is a synchronization point by definition. Enforcement puts the score back on the hot path that Path B worked to clear. So the honest ladder has two columns, not one: the overhead of always-on monitoring, which I expect to be trivial, and the overhead of always-on enforcement, which is a harder number and a different benchmark. Part 1 measures the first. The second is its own post, and I suspect its answer is exactly where selective probing earns its keep.

In the March post I sketched a three-layer honesty stack: confessions, chain-of-thought monitoring, and probes. The open question was always whether the probe layer could live in the serving path or had to stay in the lab. The observability community has been answering the general version of that question in public, wiring activation readouts into the serving engines themselves. Activation-level signals used to be the layer with the richest research results and the thinnest production tooling. That is changing fast. What has not been built is the part that matters for safety: not observability in general, but a calibrated, always-on probe layer measured to the standard you would hold a production safety control to.

None of this is hypothetical, and it is further along than I would have guessed. The systems community has built model-internal observability into serving engines and measured its cost: an asynchronous, GPU-resident readout in the low single digits of overhead, the naive hook path an order of magnitude worse, on real hardware with real workloads. The mechanism is settled and the folklore is dead. Two things are still open. Those results report medians for a debugging and observability use case, not the tail-latency statistics you need to run a safety control on every token. And none of them is a calibrated probe library wired to a monitoring policy. The runtime has been measured. The safety primitive that sits on top of it has not.

So that’s what I’m building, on top of the observability substrates the systems people just shipped rather than reinventing them. The harness here is the first milestone: standard tap points, a packed on-GPU readout, a calibrated probe library, latency reported to operator standards, and an adversarial eval asking whether a model that knows it’s being probed can evade detection. That last question is where this connects back to the Red Queen work, and it deserves its own post.

Part 2 will have the numbers, whichever way they land. If I’m wrong about prediction 2, that’s a more interesting post than if I’m right, and it redirects the design toward selective probing of high-stakes actions, which the March post already argued for. Either outcome is a measured number for the case that actually matters, which is the whole point.

If you’re running probes in production today, or you’ve measured any of this and I’ve missed your write-up, I want to hear from you: contact@subhadipmitra.com

References

Enabling Performant and Flexible Model-Internal Observability for LLM Inference (DMI-Lib). (2026). arXiv:2605.11093
vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM. (2026). arXiv:2603.06588
Observation Plugin for Intercepting and Routing on Activations. vLLM RFC. vllm-project/vllm#36998
A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification. (2026). arXiv:2601.13288

Cite this article

Mitra, Subhadip. (2026, July). What Runtime Interpretability Actually Costs, Part 1: The Case for Measuring It. Subhadip Mitra. Retrieved from https://subhadipmitra.com/blog/2026/runtime-interpretability-cost/

@article{mitra2026what-runtime-interpretability-actually-costs-part-1-the-case-for-measuring-it,
  title   = {What Runtime Interpretability Actually Costs, Part 1: The Case for Measuring It},
  author  = {Mitra, Subhadip},
  journal = {Subhadip Mitra},
  year    = {2026},
  month   = {Jul},
  url     = {https://subhadipmitra.com/blog/2026/runtime-interpretability-cost/}
}

The arithmetic says probes are free

Where the cost actually hides

The experiment

Predictions, written down before I have results

What it unlocks

References

Cite this article

Get More Like This

Continue Reading

The Observer Effect in AI: When Models Know They're Being Tested - (Part 1/4)

AI Meta-Cognition - The Observer Effect Series

I Trained Probes to Catch AI Models Sandbagging