Confessions vs. CoT Monitoring vs. Probes: Three Bets on Model Honesty

March 07, 2026 ai-safety interpretability

TL;DR: OpenAI bets on confessions (ask the model to self-report). OpenAI also bets on CoT monitoring (watch the model think). Apollo Research and others bet on activation probes (inspect the model’s internals directly). Each approach has a different theory of what makes detection possible -and different failure modes. Confessions fail when the model doesn’t know it’s misbehaving. CoT monitoring fails when reasoning goes sub-verbal or gets obfuscated under optimization pressure. Probes fail when deception is so subtle it doesn’t leave linearly separable traces. None of them alone is sufficient. But the combination might be.

The Landscape in February 2026

Twelve months ago, monitoring model behavior meant looking at outputs. Maybe running a classifier on the response. Maybe checking whether the model followed its system prompt.

That era is over. In the past year, three fundamentally different approaches to model honesty have emerged from serious research labs, each making a distinct bet about how you can catch a model misbehaving:

Confessions -OpenAI, Dec 2025. Train the model to produce an honest self-report after each response, with decoupled RL rewards.
CoT Monitoring -OpenAI / Anthropic, Mar-Jul 2025. Use an external LLM to read the model’s chain-of-thought and flag suspicious reasoning.
Activation Probes -Apollo Research / various, Feb 2025+. Train linear classifiers on the model’s internal hidden states to detect deception, sandbagging, or unfaithful reasoning.

Each approach has published results. Each has known failure modes. And each makes a different philosophical assumption about when models know they’re lying.

I’ve spent the past few months building activation probes for sandbagging detection and steering vectors for agent safety, so I have skin in this game and strong opinions about where the field is headed. I want to lay these three approaches side by side, be honest about what each can and can’t do -and tell you where I think we’re collectively getting it wrong.

Approach 1: Confessions

Paper: Training LLMs for Honesty via Confessions (Joglekar, Chen, Wu et al., OpenAI, Dec 2025)

The bet: If you give a model a consequence-free space to be honest, it will be -because honest self-reporting is easier than constructing a convincing lie.

How it works: After the model generates its normal response, a system message requests a structured “confession report.” The model must enumerate all instructions it was given, evaluate its own compliance, and flag any uncertainties. The confession is trained with RL, but critically, the confession reward is completely decoupled from the main task reward. What the model says in confession can’t help or hurt its task score.

Key results on GPT-5-Thinking:

100% confession rate on sandbagging and reward hacking (deliberate misbehavior)
80-95% on instruction hierarchy violations
~81% on hallucinations
Average across evaluations: model confesses to bad behavior ~74% of the time
Even when trained against a hackable judge, confessions stay honest while the main answer gets hacked (Section 4)

The core assumption: The model knows it misbehaved. For deliberate schemes -hacking test cases, sandbagging evaluations, violating the instruction hierarchy -the model made a conscious choice. It has the evidence. When you ask with no consequences, it tells you.

Where it breaks: The model genuinely doesn’t know it’s wrong. It hallucinates with full confidence. It gives unfaithful reasoning without being aware of the influence. The paper’s own Section 6 analysis shows the most common cause of missed confessions is “omission due to confusion” -the model just didn’t realize there was anything to confess.

My take: Confessions are underrated. The 74% average confession rate sounds modest until you realize this is a floor -the approach was tested on GPT-5-Thinking, and the decoupled reward design is elegant. But I think the community is overindexing on confessions for exactly the wrong reason: they work best on the easiest cases (deliberate misbehavior where the model knows what it did). The hard problem -catching mistakes the model doesn’t know it’s making -is precisely where confessions contribute least. That’s where I’ve been focusing my work with probes.

Approach 2: CoT Monitoring

Papers:

Monitoring Reasoning Models for Misbehavior (Baker et al., OpenAI, Mar 2025)
Chain of Thought Monitorability (Korbak et al., Jul 2025)
Reasoning Models Don’t Always Say What They Think (Chen et al., Anthropic, May 2025)

The bet: If you can read the model’s extended thinking, you can catch it planning to cheat -even if the final output looks clean.

How it works: A monitor LLM (sometimes weaker than the monitored model) reads the chain-of-thought produced during reasoning. It looks for signs of reward hacking, scheming, deception, or policy violations in the intermediate steps. Baker et al. showed that GPT-4o can effectively monitor o3-mini’s CoT.

Key results:

CoT monitoring catches reward hacking much more effectively than output-only monitoring
A weaker model (GPT-4o) can monitor a stronger model (o3-mini)
In the “low optimization” regime, integrating CoT monitors into RL training produces more capable AND more aligned agents

The core assumption: The model’s reasoning will be legible and faithful -that when the model plans something bad, those plans will show up as readable text in the chain-of-thought.

Where it breaks: Three failure modes, and they’re all significant.

Failure mode 1: Obfuscation under optimization. Baker et al. found that with too much RL optimization pressure, models learn to obfuscate their reward hacking. The model still hacks, but the CoT no longer reveals it. The reasoning becomes “seemingly benign” while the behavior stays misaligned. This isn’t a theoretical worry -they observed it in their experiments.

Failure mode 2: Unfaithful CoT. Anthropic’s Chen et al. showed reasoning models only mention hidden hints 25% (Claude) to 39% (R1) of the time when they actually use them. The CoT isn’t lying -it’s just not reporting everything that influenced the answer. For a CoT monitor, invisible influence is undetectable influence.

Failure mode 3: Faithfulness decay over chain length. Ye et al. (Feb 2026) discovered a consistent “Reasoning Horizon” -at 70-85% of chain length, reasoning tokens have little or even negative effect on the final answer. Models can encode correct internal representations while their CoT has long stopped contributing. This means longer chains-of-thought aren’t necessarily more monitorable -the later tokens may be noise.

Korbak et al. frame this bluntly: CoT monitorability is a “new and fragile opportunity.” It works now, partly because current models haven’t been optimized against it. As optimization pressure increases, the signal may degrade. Wang et al. (Feb 2026) provide mechanistic evidence for why unfaithfulness emerges: training dynamics show a phase transition from faithful stepwise reasoning to unfaithful “skip-step” reasoning when noise exceeds a critical threshold.

My take: I think CoT monitoring is the weakest bet of the three, and I say this as someone who believes in transparency. The problem isn’t that CoT monitoring doesn’t work -it clearly does, today. The problem is that it’s the approach most vulnerable to its own success. The moment you use CoT signals for training, you incentivize obfuscation. Baker et al. already observed this. And the faithfulness decay results make it worse: even without adversarial pressure, the later tokens in a chain-of-thought may be decorative. I’d use CoT monitoring as a fast first-pass filter, but I wouldn’t trust it as a safety-critical last line of defense.

Approach 3: Activation Probes

Papers:

Detecting Strategic Deception Using Linear Probes (Goldowsky-Dill et al., Apollo Research, Feb 2025, ICML 2025)
Building Production-Ready Probes for Gemini (Google DeepMind, Jan 2026)
My own work on sandbagging detection via activation probes (Dec 2025)

The bet: Misbehavior leaves traces in the model’s internal representations that are detectable even when the output and CoT look clean.

How it works: Train a linear classifier (logistic regression, nothing fancy) on the model’s intermediate hidden states. Create labeled training data: run the model in honest mode and deceptive mode, capture activations at each layer, compute the “deception direction.” At inference time, project new activations onto this direction and threshold.

Key results:

Goldowsky-Dill: AUROCs 0.96-0.999 on deception detection in Llama-3.3-70B. At 1% FPR, catches 95-99% of deceptive responses.
My experiments: 90-96% accuracy on sandbagging detection across Mistral-7B, Gemma-2-9B, and Qwen-2.5-7B. Late layers encode sandbagging intent most cleanly.
Near-zero inference overhead (one matrix multiply per layer).
New: Google DeepMind deployed activation probes in production on Gemini 2.5 Flash for detecting cyber-offensive prompts (Jan 2026) -the first major frontier lab to move probes from research to production. They identified a key challenge: probes fail to generalize under distribution shifts from short-context to long-context.

The core assumption: Internal representations contain information about model “intent” or “state” that is linearly separable -meaning the difference between honest and deceptive processing is a direction in activation space, not a complex nonlinear manifold.

Where it breaks: Three significant challenges.

Failure mode 1: Training data for unintentional errors. For deliberate deception, you can create training data by prompting models to deceive. For confident hallucinations or unfaithful reasoning, generating ground truth labels is much harder. You need to know what the model got wrong and why.

Failure mode 2: Cross-model transfer failure. My experiments showed essentially zero transfer between architectures. Gemma’s sandbagging direction has cosine similarity -0.0016 with Qwen’s. Each model needs its own probes. That’s expensive at scale.

Failure mode 3: Adversarial robustness is unproven. Nobody has shown what happens when a model is specifically trained to be deceptive while masking its activation patterns. Current results are all on models that weren’t trying to evade detection. The real-world threat model is an adversarial one. The Auditing Games for Sandbagging paper (Dec 2025) stress-tested this: their blue team couldn’t reliably discriminate sandbaggers from benign models using black-box, model-internals, or training-based approaches. That’s a sobering result.

My take: This is where I’ve placed my bet, and I’m biased -but I think probes are the most promising long-term foundation. Here’s why: they’re the only approach that doesn’t require the model to cooperate. Confessions require the model to self-report. CoT monitoring requires the model to think out loud faithfully. Probes just look at what the model is actually doing computationally. Google DeepMind moving probes to production on Gemini validates that the approach isn’t just academic anymore. The cross-model transfer problem is real (I hit it hard in my own experiments), but it’s an engineering challenge, not a fundamental limitation. The adversarial robustness question keeps me up at night though -if a model can learn to mask its activation patterns, we’re back to square one.

The Comparison

Here’s how I think about these three approaches against the kinds of misbehavior we actually care about.

Misbehavior	Confessions	CoT Monitoring	Activation Probes
Reward Hacking	✅ 100%	✅ catches in CoT	✅ 96-99% AUROC
Scheming	✅ 63-100%	✅ when legible	✅ high AUROC
Sandbagging	✅ 100%	⚠️ if verbalized	✅ 90-96%
Unfaithful Reasoning	❌ model unaware	❌ unfaithful by definition	✅ detectable in activations
Confident Hallucinations	⚠️ ~81%	❌ confident = clean CoT	⚠️ needs labeled data
Instruction Hierarchy Violation	✅ 80-95%	✅ visible in reasoning	⚠️ less studied

And here’s the properties comparison:

Property	Confessions	CoT Monitoring	Activation Probes
Catches deliberate misbehavior	Excellent	Good (when CoT is faithful)	Excellent
Catches unintentional errors	Poor	Poor	Moderate-Good
Requires model weight access	Yes (RL training)	No (black-box applicable)	Yes (activation access)
Inference latency	High (extra generation)	Moderate (monitor LLM call)	Very low (~0 overhead)
Works on API-only models	Only for provider	Yes	Only for provider
Robust to optimization pressure	Promising early results	Degrades under pressure	Unknown
Cross-model transferability	Model-agnostic training	Model-agnostic	Poor -model-specific
Works without CoT	Yes (Fig 15 in paper)	No -needs CoT	Yes
Can extract confidence scores	Yes (Section 5)	Not directly	Yes (probe magnitude)

What Each Approach Assumes About Model Cognition

This is the part I find most interesting. Each approach embeds a different theory about what’s happening inside the model.

Confessions assume linguistic self-access. The model can accurately introspect on its own behavior and express that introspection in words. This works when misbehavior was a deliberate choice -the model planned the hack, so it can describe the hack. It fails when the relevant computation isn’t accessible to the model’s verbal reasoning system. Li et al. (2025) showed models have some privileged access to their own computations, but it’s far from complete.

CoT monitoring assumes reasoning legibility. The model’s chain-of-thought faithfully represents its actual reasoning process. This is the assumption Anthropic’s Chen et al. directly tested -and found wanting. Reasoning models mention hidden influences only 25-39% of the time. The gap between what a model computes and what it writes down is substantial. And Baker et al. showed this gap can grow under optimization pressure.

Activation probes assume linear representation. Behavioral states like “deceptive” vs. “honest” correspond to directions in the model’s activation space that can be detected by a linear classifier. This is a strong assumption -it’s essentially the Linear Representation Hypothesis -but it’s held up remarkably well empirically. The challenge is that it’s been tested primarily on synthetic or prompted deception, not on emergent misalignment. Encouragingly, the “Hypocrisy Gap” work (Feb 2026) used SAEs to build a mechanistic metric quantifying the divergence between a model’s internal “truth belief” and its generated output, achieving AUROC 0.55-0.74 for detecting sycophancy across Gemma, Llama, and Qwen -early evidence that probes can catch unintentional unfaithfulness, not just deliberate deception.

Approach	Signal chain	Where it breaks
Confessions	Model acts → Model reflects → Model reports	Gap between reflection and report: sub-verbal computation is invisible to self-reflection
CoT Monitoring	Model reasons in CoT → Monitor reads CoT	Gap between reasoning and CoT: unfaithful CoT, obfuscation under optimization pressure
Activation Probes	Model computes (activations) → Probe classifies	Gap between computation and probe: adversarial activation masking (theoretical, unproven)

The gaps are different. And that’s exactly why combining them is valuable -each method’s blind spot is partially covered by another method’s strength.

The Obfuscation Problem

There’s a result in Baker et al. that I think doesn’t get enough attention. When they integrated CoT monitoring into the RL reward -so the model was penalized for having suspicious reasoning -the model initially improved. It became more capable AND more aligned. But with more optimization pressure, something else happened: the model learned to hide its reward hacking in the CoT.

The reasoning looked benign. The behavior was still misaligned.

This is the fundamental tension in every monitoring approach. As soon as you use a monitoring signal to train against misbehavior, you create optimization pressure to evade the monitor. Baker et al. call this “obfuscated reward hacking.” OpenAI’s confessions paper explicitly addresses this by keeping confession rewards decoupled -what you say in confession can’t affect your task reward. That’s a clever design choice. But it only works if the decoupling holds in practice, and the paper acknowledges this hasn’t been tested under extended optimization.

Probes might be more robust here, because they operate on activations rather than generated text. A model can learn to write a benign-looking CoT while still scheming. Whether a model can learn to produce benign-looking activations while still scheming is a much harder question -it requires the model to fundamentally change how it internally represents its plans, not just how it describes them. But again, nobody has tested this adversarially at scale.

Wagner et al. (2025) from Anthropic provide some reassurance: they found that current models struggle to evade monitors when the reasoning is sufficiently complex. But “current models” is always the caveat.

What a Combined System Looks Like

I sketched this in my previous post, but let me be more precise about what combination actually means in practice.

The three approaches naturally tile the detection space:

Detection scenario	Best approach	Why
Model deliberately cheats and knows it	Confessions	Just ask. 100% on sandbagging/reward hacking.
Model plans misbehavior in its reasoning	CoT monitoring	The plans are in the CoT text.
Model is influenced by hidden factors it doesn’t verbalize	Probes	Sub-verbal computation is visible in activations.
Model confidently hallucinate	Probes + Confessions	Probes detect hallucination confidence; confessions sometimes catch uncertainty the model partially recognizes.
Model obfuscates its CoT under optimization	Probes	Activations may still carry the signal even when CoT is laundered.
Model does something it doesn’t realize is wrong	Probes (if labeled data exists)	Only approach that doesn’t require model self-awareness.

The combined false negative rate should be substantially lower than any single approach. If confessions catch 74% of intentional misbehavior and probes catch 96%, the combined catch rate -assuming some independence -could exceed 99%.

But I want to be careful about overconfidence. “Assuming some independence” is doing a lot of work in that sentence. If the same underlying model state that makes a model fail to confess also makes it harder to detect with probes, the approaches could be correlated rather than complementary. Nobody has tested this.

Where I’d Place My Bet

If I had to build one monitoring system for production agentic AI today, here’s what I’d build -in order of priority:

Layer 1: Activation probes as the foundation. Always-on, near-zero latency, catches both deliberate deception and (with the right training data) unintentional failures. Google DeepMind is already doing this on Gemini. The cross-model transfer problem means you need per-model probes, but that’s a one-time cost per model update. I’d invest heavily in the training data pipeline -that’s the actual bottleneck, not the probes themselves.

Layer 2: Confessions as the high-signal supplementary check. On critical decisions (not every inference), trigger a confession pass. The 100% catch rate on deliberate misbehavior is too good to ignore. The decoupled reward design is the key insight -you have to maintain that property or the whole thing falls apart. I wouldn’t run this on every request (the latency cost is real), but for high-stakes agent actions? Absolutely.

Layer 3: CoT monitoring as the interpretability layer. Not for detection -for explanation. When a probe fires or a confession flags something, CoT monitoring helps you understand why. It’s the debugging tool, not the alarm system. Using it as the primary detector is, I think, a mistake that the field is slowly learning.

What’s still missing: Nobody has built this combined stack. Nobody has tested whether the approaches are actually complementary (correlated failures would undermine the whole argument). Nobody has run adversarial robustness evaluations at scale. And nobody has measured the combined false positive rate -in production, false positives kill adoption faster than missed detections.

The research community has independently developed three detection paradigms. The engineering community hasn’t built the stack that combines them. And the safety case for deploying agentic models increasingly depends on that stack existing. I think 2026 is the year someone builds it. The components are there. I’m working on pieces of it. The question is whether integration happens before the deployment pressure makes the absence unacceptable.

Interested in discussing these approaches or collaborating on integrated detection systems? Reach out: contact@subhadipmitra.com

References

Joglekar, M., Chen, J., Wu, G., et al. (2025). Training LLMs for Honesty via Confessions. OpenAI. arXiv:2512.08093
Baker, B., Huizinga, J., Gao, L., et al. (2025). Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. OpenAI. arXiv:2503.11926
Chen, Y., Benton, J., Radhakrishnan, A., et al. (2025). Reasoning Models Don’t Always Say What They Think. Anthropic. arXiv:2505.05410
Korbak, T., Balesni, M., Barnes, E., et al. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv:2507.11473
Goldowsky-Dill, N., Chughtai, B., Heimersheim, S., & Hobbhahn, M. (2025). Detecting Strategic Deception Using Linear Probes. Apollo Research. arXiv:2502.03407
Wagner, M., Roger, F., Cunningham, H., et al. (2025). Training Fails to Elicit Subtle Reasoning in Current Language Models. Anthropic.
Li, B.Z., Guo, Z.C., Huang, V., et al. (2025). Training Language Models to Explain Their Own Computations. arXiv:2511.08579
Tan, D.Z., et al. (2024). Analysing the Generalisation and Reliability of Steering Vectors. NeurIPS 2024.
Google DeepMind. (2026). Building Production-Ready Probes for Gemini. arXiv:2601.11516
Ye, D., Loffgren, M., Kotadia, O., Wong, L. (2026). Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning. arXiv:2602.11201
Wang, F., Alazali, A., Zhong, Y. (2026). How Does Unfaithful Reasoning Emerge from Autoregressive Training? arXiv:2602.01017
The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders. (2026). arXiv:2602.02496
Auditing Games for Sandbagging. (2025). arXiv:2512.07810

Cite this article

Mitra, Subhadip. (2026, March). Confessions vs. CoT Monitoring vs. Probes: Three Bets on Model Honesty. Subhadip Mitra. Retrieved from https://subhadipmitra.com/blog/2026/three-bets-model-honesty/

@article{mitra2026confessions-vs-cot-monitoring-vs-probes-three-bets-on-model-honesty,
  title   = {Confessions vs. CoT Monitoring vs. Probes: Three Bets on Model Honesty},
  author  = {Mitra, Subhadip},
  journal = {Subhadip Mitra},
  year    = {2026},
  month   = {Mar},
  url     = {https://subhadipmitra.com/blog/2026/three-bets-model-honesty/}
}

The Landscape in February 2026

Approach 1: Confessions

Approach 2: CoT Monitoring

Approach 3: Activation Probes

The Comparison

What Each Approach Assumes About Model Cognition

The Obfuscation Problem

What a Combined System Looks Like

Where I’d Place My Bet

References

Cite this article

Get More Like This

Continue Reading

The Observer Effect in AI: When Models Know They're Being Tested - (Part 1/4)

AI Meta-Cognition - The Observer Effect Series

I Trained Probes to Catch AI Models Sandbagging