interpretability
an archive of posts with this tag
| Dec 20, 2025 | I Trained Probes to Catch AI Models Sandbagging |
|---|---|
| Dec 18, 2025 | Why Steering Vectors Beat Prompting (And When They Don't) |
an archive of posts with this tag
| Dec 20, 2025 | I Trained Probes to Catch AI Models Sandbagging |
|---|---|
| Dec 18, 2025 | Why Steering Vectors Beat Prompting (And When They Don't) |