TL;DR: The claim that activation probes are too expensive to run in production is folklore. On compute it is wrong by six orders of magnitude: a linear probe is a dot product, roughly 0.00006 percent of the forward pass it rides on. Any real cost hides in implementation, in CUDA graph breaks, per-step device-to-host copies, and batching bookkeeping, none of which is intrinsic to probing. The systems community has already measured the adjacent problem, general model-internal observability, and the numbers back this up: naive hooks run an order of magnitude over baseline while a well-engineered asynchronous readout costs low single digits. What nobody has measured is the case that decides a safety architecture, an always-on calibrated probe library reported with operator-grade statistics. This is Part 1: the argument, what prior work already settles, the gap it leaves, and the harness aimed at that gap. Part 2 has the numbers.
Back in March, in the model honesty post, I wrote that I wouldn’t run probes on every request because the latency cost is real. It reads like engineering judgment. It was folklore. I had never measured it, so I went to find who had. The answer turned out to be more interesting than “nobody”: the systems community has already measured the adjacent problem, general model-internal observability, and those numbers puncture the folklore on their own. A naive forward hook runs an order of magnitude over baseline, and a readout that stays on the GPU and exports asynchronously costs low single digits. What is still missing is the measurement an operator actually needs before staking a safety case on it, and that gap is what this series is about.
It matters more than a benchmark footnote should, because this number quietly decides the architecture of every runtime monitoring stack being proposed right now. If activation probes are expensive, they get relegated to offline audits and sampled traffic, and the safety case for deployed agents rests on text-level signals alone. If they’re cheap, they belong in the serving path, on every token, always on. Those are two completely different systems, and once you step outside the observability papers, we’re still choosing between them on vibes.
So this is part 1 of settling it properly: the arithmetic, what the existing measurements already show, the specific gap they leave for a safety-probe library, and the design of a harness aimed straight at that gap.
The arithmetic says probes are free
Start with what a probe actually is at inference time. A linear probe is a dot product between a hidden state and a learned weight vector. That’s it. For a 7B-class model with hidden dimension 4096, one probe readout at one layer costs 4096 multiply-accumulates, call it 8,192 FLOPs per token.
The forward pass it’s riding on costs roughly 2 FLOPs per parameter per token, so about 14 GFLOPs per token for the 7B model. The ratio is 6 × 10⁻⁷. The probe is 0.00006 percent of the work you were already doing.
Scale it up to something aggressive: 100 probes at each of 10 layers, a thousand readouts per token. Now you’re at 8.2 MFLOPs per token, which is 0.06 percent of the forward pass. Go to a 70B model and the ratio gets better, because the forward pass grows faster than the probe does.
On compute alone, the “probes are too expensive” position is not just wrong, it’s wrong by six orders of magnitude. If the folklore is true, the cost has to be hiding somewhere else.
A systems engineer will object that I picked the flattering denominator, and they’d be half right. Decode is memory-bandwidth bound, not compute bound. You stream the entire weight matrix per token and do almost no arithmetic per byte, so FLOPs were never what set inter-token latency. Fine. Run the same argument in bytes: the probe weights for a thousand readouts are single-digit megabytes to move per token against the ~14 gigabytes you already move for the model, and the score buffer is rounding error. The ratio barely shifts. That the point survives the change of units is the tell. If a real cost exists, it is not in any quantity the arithmetic measures, FLOPs or bytes. It is in the launches, the copies, and the scheduling, and those never show up in a back-of-the-envelope at all.
Where the cost actually hides
And it might be. Modern serving engines are hostile environments for anything that wants to touch intermediate activations. Three suspects, in the order I’d rank them:
CUDA graph breaks. vLLM and friends capture the decode step as a CUDA graph and replay it. It’s one of the larger single wins in modern serving. A naive PyTorch forward hook is Python code executing in the middle of the forward pass, which is incompatible with graph capture, so attaching one can force the whole decode path back to eager mode. You’re not paying for the probe. You’re paying for losing graph replay on everything. This is not a hypothesis: an open vLLM internals plug-in restricts observation to the prefill phase precisely to preserve CUDA graphs, and pencils in a decode-time mode at roughly 25 percent throughput. That is the price of the naive path, already on the record.
Synchronization and device-to-host copies. A hidden state at d=4096 in bf16 is 8 KB per token per tapped layer. The bandwidth is nothing. But if your implementation copies scores to the CPU every decode step, you’ve inserted a synchronization point into a pipeline that lives or dies by staying asynchronous. Small syncs, repeated thousands of times per second, are how fast systems die.
Batching interactions. Continuous batching means the “batch” at any decode step is a shifting mix of requests at different positions. Per-request probe attribution has to slice a packed tensor correctly without adding its own bookkeeping overhead in Python.
Notice what all three have in common: none of them are intrinsic to probing. They’re artifacts of implementation strategy. Which suggests the real claim worth testing is not “probes are cheap” or “probes are expensive” but something more specific:
Probing is nearly free if the readout stays on the GPU and plays nice with graph capture. It’s expensive only when implemented the way a researcher would implement it in a notebook.
Here’s the shape of both paths:
Path A is what every interpretability codebase does today, because interpretability codebases were built for analysis, not serving. Path B packs all probes at a layer into a single weight matrix, does one small matmul per tapped layer inside the graph, accumulates scores in a device buffer, and exports asynchronously. The hypothesis is that Path B’s marginal cost is a few microseconds per step, which at typical inter-token latencies is measurement noise.
There’s a Path C too: fuse the readout into an existing kernel’s epilogue so it costs approximately nothing even in launch overhead. I’ve written enough Triton this year to believe that’s a weekend once Path B works. But I want to know whether it’s even necessary before writing it.
The experiment
The harness is deliberately boring, which is the point. Boring benchmarks are reproducible benchmarks.
The details that matter:
Four conditions on identical hardware and identical workloads: baseline vLLM with no probes attached, the naive hook implementation, the batched on-GPU implementation, and eventually the fused one. Each condition sweeps probe count (1, 10, 100 per layer) and tap depth (1, 4, 10 layers).
Metrics are the ones operators actually care about: time to first token, inter-token latency, and sustained throughput at a fixed request arrival rate, reported at p50, p95, and p99. Plus GPU memory delta and, critically, whether CUDA graph capture survived, because that single boolean probably explains most of whatever gap shows up.
Every number gets at least five runs, warmup discarded, and a bootstrap 95 percent confidence interval. If a difference isn’t distinguishable from run-to-run noise, it gets reported as indistinguishable, not rounded into a story. I built Spark-LLM-Eval (repo) because LLM evaluation without confidence intervals is astrology, and I’m not going to commit the same sin in a latency benchmark.
Predictions, written down before I have results
Two of these are, honestly, no longer wild bets. The observability results above already show naive hooks an order of magnitude over baseline and a well-engineered asynchronous readout in the low single digits. What none of them reports is the number for a calibrated safety-probe library run on every decode token, with the tail-latency statistics an operator would demand before trusting it. So these are falsifiable predictions for my specific configuration, not the general one, and I will be held to them.
- The naive implementation (Path A) costs 15 to 40 percent of throughput on a 7B model, and almost all of it traces to the graph break and per-step synchronization rather than the probe math itself.
- The batched on-GPU implementation (Path B) costs under 2 percent of throughput at 100 probes across 10 layers, and under 1 percent in the configurations anyone would actually run.
- The fused version is statistically indistinguishable from baseline, making it a nice-to-have rather than a requirement.
- Overhead as a fraction shrinks as model size grows, so the story gets more favorable exactly where the safety stakes are highest.
If prediction 2 holds for the safety-probe case, the last serious version of the objection dies with it. Not weakens. Dies. “We can’t afford to run probes in production” stops being a claim about the technique and becomes, at most, a claim about your implementation and your tolerance for a couple of percent.
What it unlocks
The reason to care about a percentage point of throughput is that it determines the monitoring policy you can honestly defend:
That ladder hides an assumption worth stating plainly: it is a monitoring ladder, not an enforcement one. If all you want is to log a score, decorate a trace, or raise an alert, the readout can leave the critical path entirely, and Path B’s async export costs you nothing. But the moment a probe is allowed to act, to block a tool call or halt a generation, you need its verdict before you commit the token, and “before you commit the token” is a synchronization point by definition. Enforcement puts the score back on the hot path that Path B worked to clear. So the honest ladder has two columns, not one: the overhead of always-on monitoring, which I expect to be trivial, and the overhead of always-on enforcement, which is a harder number and a different benchmark. Part 1 measures the first. The second is its own post, and I suspect its answer is exactly where selective probing earns its keep.
In the March post I sketched a three-layer honesty stack: confessions, chain-of-thought monitoring, and probes. The open question was always whether the probe layer could live in the serving path or had to stay in the lab. The observability community has been answering the general version of that question in public, wiring activation readouts into the serving engines themselves. Activation-level signals used to be the layer with the richest research results and the thinnest production tooling. That is changing fast. What has not been built is the part that matters for safety: not observability in general, but a calibrated, always-on probe layer measured to the standard you would hold a production safety control to.
None of this is hypothetical, and it is further along than I would have guessed. The systems community has built model-internal observability into serving engines and measured its cost: an asynchronous, GPU-resident readout in the low single digits of overhead, the naive hook path an order of magnitude worse, on real hardware with real workloads. The mechanism is settled and the folklore is dead. Two things are still open. Those results report medians for a debugging and observability use case, not the tail-latency statistics you need to run a safety control on every token. And none of them is a calibrated probe library wired to a monitoring policy. The runtime has been measured. The safety primitive that sits on top of it has not.
So that’s what I’m building, on top of the observability substrates the systems people just shipped rather than reinventing them. The harness here is the first milestone: standard tap points, a packed on-GPU readout, a calibrated probe library, latency reported to operator standards, and an adversarial eval asking whether a model that knows it’s being probed can evade detection. That last question is where this connects back to the Red Queen work, and it deserves its own post.
Part 2 will have the numbers, whichever way they land. If I’m wrong about prediction 2, that’s a more interesting post than if I’m right, and it redirects the design toward selective probing of high-stakes actions, which the March post already argued for. Either outcome is a measured number for the case that actually matters, which is the whole point.
If you’re running probes in production today, or you’ve measured any of this and I’ve missed your write-up, I want to hear from you: contact@subhadipmitra.com
References
- Enabling Performant and Flexible Model-Internal Observability for LLM Inference (DMI-Lib). (2026). arXiv:2605.11093
- vLLM Hook v0: A Plug-in for Programming Model Internals on vLLM. (2026). arXiv:2603.06588
- Observation Plugin for Intercepting and Routing on Activations. vLLM RFC. vllm-project/vllm#36998
- A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification. (2026). arXiv:2601.13288