Results

11% → 88%

of A100 peak bandwidth

Custom Triton RMSNorm kernel

Inference

A hand-written Triton kernel takes a memory-bound op from PyTorch's 11% bandwidth utilization to 88%, an 8.1x speedup.

Post Code Try the roofline

89–131%

of Megablocks throughput

Fused MoE dispatch (TritonMoE)

Inference

A pure-Triton Mixture-of-Experts kernel matches or beats Stanford's CUDA-optimized Megablocks at inference batch sizes, on both NVIDIA A100 and AMD MI300X, with 35% less global memory traffic and no CUDA.

Paper Post Code

2–3×

end-to-end speedup

Speculative decoding

Inference

Tree speculation, EAGLE, and Medusa combined, with reference implementations and an acceptance-rate model, on memory-bound autoregressive decode.

Post Code

90–96%

accuracy

Sandbagging detection probes

Interpretability

Linear probes on hidden states detect deliberate underperformance across Mistral, Gemma, and Qwen, before any output is generated. Steering cut Gemma sandbagging by 20%.

Post Try the probe cost

Linear

scaling to cluster size

Spark-LLM-Eval

Evaluation

Distributed, statistically rigorous LLM evaluation: bootstrap confidence intervals, paired significance tests, and content-addressable caching on Delta Lake.

Paper Post

Non-monotonic

safety across generations

Red Queen adversarial evolution

Safety

Quality-diversity (MAP-Elites) red-teaming surfaces diverse, transferable vulnerabilities. A mid-generation model can be more vulnerable than both its predecessor and its successor, a pattern static benchmarks miss.

Paper Workshop Code

If a number here isn't reproducible from a linked repo or paper, it doesn't belong on this page.