Plain language
What this result means
This is the most directly practical result in the ledger. It connects research output to actual runtime: when a model is bottlenecked by small batched matrices, fine-grained MoE, depthwise convolution, or skinny int8 decode, a kernel that fills the GPU or removes a memory pass can matter immediately.
- The same report also records losses. Large square GEMM remains cuBLAS territory, and a generic W4A16 kernel is much slower than tinygemm.
- The kernel speedups are measured as ratios under paired, interleaved timing, because the GPU was shared and absolute timings were noisy.
- The op-count records and the kernel records are separate. Fewer arithmetic operations did not automatically make a faster GPU kernel.
Visual notes
How to read the result
Result table
Measured L40S speedups for small batched GEMM, fused depthwise conv, int8 decode, MoE, and GQA decode.
| Cell | Baseline | Numaro | Delta | Note |
|---|---|---|---|---|
| Batched matmul | torch.bmm | 2-4.5x | faster | small/medium matrices, batch >= 512 |
| Fused depthwise 3x3 | cuDNN + activation | 1.55x fp16 / 1.31x fp32 | faster | bit-exact |
| int8 W8A8 | torch._int_mm | up to 3.7x | faster | skinny-M decode |
| MoE grouped GEMM | per-expert loop | 2.5-3.6x end-to-end | faster | fine-grained MoE |
| W4A16 | tinygemm | ~12x slower | loss | kept as a boundary condition |
Method
How it was found
Each kernel targets a specific production gap: not enough occupancy, an avoidable memory pass, a missing primitive, or a launch-heavy loop.
- Mapped where cuBLAS/cuDNN/PyTorch under-filled the GPU or forced extra memory traffic.
- Wrote narrow Triton kernels for those exact regimes instead of trying to beat vendor libraries everywhere.
- Timed stock and custom kernels back-to-back under a GPU timing lock.
- Kept losses in the report so the boundary of the method is visible.
Verification
How it was checked
Each subdirectory has a verifier that checks correctness before speed. Some kernels are bit-identical; others are compared against a higher-precision or stock reference with an explicit tolerance where reduction order differs.
Scope
What is not being claimed
All numbers are on one L40S under the stated environment. They are not claims for every GPU, every shape, or cross-library SOTA. Large square GEMM and 4-bit weight-only decode are explicitly not beaten.
References
Baseline sources
Citation
How to cite
Numaro Autoresearch Team. "Bit-exact GPU kernels in regimes vendor libraries leave open." Numaro Research Report NUMARO-2026-003, 2026.
@techreport{numaro2026FasterMlKernels,
title = {Bit-exact GPU kernels in regimes vendor libraries leave open},
author = {Numaro Autoresearch Team},
institution = {Numaro},
number = {NUMARO-2026-003},
year = {2026},
url = {https://numaro.tech/research/faster-ml-kernels-2026/}
}