Bit-exact GPU kernels in regimes vendor libraries leave open

Numaro Autoresearch Team

Plain language

What this result means

This is the most directly practical result in the ledger. It connects research output to actual runtime: when a model is bottlenecked by small batched matrices, fine-grained MoE, depthwise convolution, or skinny int8 decode, a kernel that fills the GPU or removes a memory pass can matter immediately.

The same report also records losses. Large square GEMM remains cuBLAS territory, and a generic W4A16 kernel is much slower than tinygemm.
The kernel speedups are measured as ratios under paired, interleaved timing, because the GPU was shared and absolute timings were noisy.
The op-count records and the kernel records are separate. Fewer arithmetic operations did not automatically make a faster GPU kernel.

Visual notes

How to read the result

Horizontal bar chart of measured kernel speedups by regime, including an honest W4A16 loss. — **Regime map**The big numbers live in specific gaps. The chart also includes the W4A16 loss to show where the method does not apply.

Horizontal bar chart showing operator fraction of layer time and resulting end-to-end speedups. — **Amdahl check**End-to-end speedup tracks how much of the layer the swapped operator occupies. Fine-grained MoE benefits most because the grouped GEMM is almost the whole layer.

Result table

Measured L40S speedups for small batched GEMM, fused depthwise conv, int8 decode, MoE, and GQA decode.

Cell	Baseline	Numaro	Delta	Note
Batched matmul	torch.bmm	2-4.5x	faster	small/medium matrices, batch >= 512
Fused depthwise 3x3	cuDNN + activation	1.55x fp16 / 1.31x fp32	faster	bit-exact
int8 W8A8	torch._int_mm	up to 3.7x	faster	skinny-M decode
MoE grouped GEMM	per-expert loop	2.5-3.6x end-to-end	faster	fine-grained MoE
W4A16	tinygemm	~12x slower	loss	kept as a boundary condition

Method

How it was found

Each kernel targets a specific production gap: not enough occupancy, an avoidable memory pass, a missing primitive, or a launch-heavy loop.

Mapped where cuBLAS/cuDNN/PyTorch under-filled the GPU or forced extra memory traffic.
Wrote narrow Triton kernels for those exact regimes instead of trying to beat vendor libraries everywhere.
Timed stock and custom kernels back-to-back under a GPU timing lock.
Kept losses in the report so the boundary of the method is visible.

Verification

How it was checked

Each subdirectory has a verifier that checks correctness before speed. Some kernels are bit-identical; others are compared against a higher-precision or stock reference with an explicit tolerance where reduction order differs.

Scope

What is not being claimed

All numbers are on one L40S under the stated environment. They are not claims for every GPU, every shape, or cross-library SOTA. Large square GEMM and 4-bit weight-only decode are explicitly not beaten.

References

Baseline sources

Citation

How to cite

Numaro Autoresearch Team. "Bit-exact GPU kernels in regimes vendor libraries leave open." Numaro Research Report NUMARO-2026-003, 2026.

@techreport{numaro2026FasterMlKernels,
  title = {Bit-exact GPU kernels in regimes vendor libraries leave open},
  author = {Numaro Autoresearch Team},
  institution = {Numaro},
  number = {NUMARO-2026-003},
  year = {2026},
  url = {https://numaro.tech/research/faster-ml-kernels-2026/}
}