binary breakthroughs

Notes on building data platforms, AI systems, and the infrastructure between them

I Trained Probes to Catch AI Models Sandbagging

First empirical demonstration of activation-level sandbagging detection. Linear probes achieve 90-96% accuracy across Mistral, Gemma, and Qwen models. Key finding - sandbagging representations are model-specific, and steering can reduce sandbagging by 20%.

9 min read · December 20, 2025

Why Steering Vectors Beat Prompting (And When They Don't)

I tested activation steering on 4 agent behaviors across 3 models. The results surprised me.

9 min read · December 18, 2025

Why I Built a Spark-Native LLM Evaluation Framework (And What I Learned)

A deep dive into building distributed LLM evaluation infrastructure that actually scales - architectural decisions, trade-offs, and lessons learned.

15 min read · December 15, 2025

The MCP Maturity Model: Evaluating Your Multi-Agent Context Strategy

A practical framework for evaluating your multi-agent context management strategy. From ad-hoc string concatenation to self-evolving context systems - where does your architecture stand?

32 min read · November 19, 2025

From 11% to 88% Peak Bandwidth: Writing Custom Triton Kernels for LLM Inference

A hands-on exploration of writing custom GPU kernels with OpenAI Triton, going from PyTorch's 11% bandwidth utilization to 88% on RMSNorm.

12 min read · June 15, 2025

Making LLMs Faster: My Deep Dive into Speculative Decoding

A deep dive into implementing speculative decoding from scratch, with benchmarks on GPT-2 and extensions to diffusion models.

13 min read · March 20, 2025

All Articles

Reading List

A collected list of research papers, tech blogs, videos that I follow

6 min read · January 02, 1970

1970 list others
Welcome to my blog

Let's talk tech! I'll post everything from polished pieces to spur-of-the-moment thoughts. And if you've got ideas for posts or want to collaborate, let's connect!

2 min read · January 01, 1970

1970 welcome others