Confessions vs. CoT Monitoring vs. Probes: Three Bets on Model Honesty
Three labs. Three different bets on how to catch models misbehaving. Each makes different assumptions about when models 'know' they're lying. Here's what works, what doesn't, and what happens when you combine them.