Beating CUDA with Triton: A Fused MoE Dispatch Kernel for Mixtral and DeepSeek
I wrote a fused Mixture-of-Experts dispatch kernel in pure Triton that beats Stanford's CUDA-optimized Megablocks at inference batch sizes, and runs on both NVIDIA and AMD GPUs without a single line of CUDA.