Type to search posts and projects to navigate

Measured

Results

Measured results from the work, each linked to the post, paper, or code that reproduces it. Numbers first, claims second.

11% → 88%
of A100 peak bandwidth

Custom Triton RMSNorm kernel

Inference

A hand-written Triton kernel takes a memory-bound op from PyTorch's 11% bandwidth utilization to 88%, an 8.1x speedup.

89–131%
of Megablocks throughput

Fused MoE dispatch (TritonMoE)

Inference

A pure-Triton Mixture-of-Experts kernel matches or beats Stanford's CUDA-optimized Megablocks at inference batch sizes, on both NVIDIA A100 and AMD MI300X, with 35% less global memory traffic and no CUDA.

2–3×
end-to-end speedup

Speculative decoding

Inference

Tree speculation, EAGLE, and Medusa combined, with reference implementations and an acceptance-rate model, on memory-bound autoregressive decode.

90–96%
accuracy

Sandbagging detection probes

Interpretability

Linear probes on hidden states detect deliberate underperformance across Mistral, Gemma, and Qwen, before any output is generated. Steering cut Gemma sandbagging by 20%.

Linear
scaling to cluster size

Spark-LLM-Eval

Evaluation

Distributed, statistically rigorous LLM evaluation: bootstrap confidence intervals, paired significance tests, and content-addressable caching on Delta Lake.

Non-monotonic
safety across generations

Red Queen adversarial evolution

Safety

Quality-diversity (MAP-Elites) red-teaming surfaces diverse, transferable vulnerabilities. A mid-generation model can be more vulnerable than both its predecessor and its successor, a pattern static benchmarks miss.

If a number here isn't reproducible from a linked repo or paper, it doesn't belong on this page.

Subhadip Mitra