Instruments

01 Inference · Making LLMs faster Memory-bandwidth roofline Is a decode workload memory-bound or compute-bound? Plug in the model, batch size, precision, and GPU, and watch the operating point cross the ridge.
02 Interpretability · What runtime interpretability actually costs Activation probe cost What does running an activation probe at inference actually cost? Compare a linear probe against the forward pass it rides on, by model and probe scope.
03 Architecture · DeepSeek mHC The manifold dial Hyper-connections make composite forward gain explode with depth. A few Sinkhorn iterations project the mixing matrix onto the doubly-stochastic manifold and bound it. Drag the dial.
04 Inference · Making LLMs faster Speculative decoding speedup A cheap draft model proposes tokens; the target verifies them in one pass. How much speedup that buys depends on the acceptance rate, the draft length, and the cost ratio, and there is an optimal draft length.
05 Adversarial · Cross-generational transfer (extended) Quality-diversity archive Optimize one number and search collapses onto the single best exploit. MAP-Elites keeps the best solution in every cell of a behavior space, mapping a diverse archive of vulnerabilities. Run both on the same budget and watch the difference.
06 Agents · Loop Engineering Convergence is not correctness An agentic loop retries until a verifier says done, so it converges on whatever passes the check. Whether that is correct is decided by the verifier, not by convergence. Watch refinement and reward-hacking pull against each other.
07 Inference · DSpark in production Speculation under load Speculative decoding looks free on an idle GPU and becomes a tax on a busy fleet. Three policies share one capacity budget: plain decoding, fixed-length speculation, and a DSpark-style scheduler that shrinks its verify window as utilization climbs.

Each instrument is a self-contained widget, also embedded in the post it belongs to. More as I build them.