Surprisingly Fast AI-Generated Kernels We Didn't Mean to Publish (Yet)

https://news.ycombinator.com/rss Hits: 32

Summary

TL;DR We have some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch. Some of our highlighted results: Matmul (FP32): 101.3% performance of FP32 torch.matmul; problem size: 4096x4096 square matrices Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2) Softmax: 111.8% performance of FP32 torch.softmax; problem size: (4096, 65536) input tensor LayerNorm: 484.4% performance of FP32 torch.nn.LayerNorm; problem size: (16, 64, 256, 256) input tensor Conv2D + ReLU + MaxPool: 290.1% performance of FP32 torch reference, 189.0% performance of FP32 torch.compile() reference; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2), maxpool(kernel_size=3, stride=2) (Our results are benchmarked on an Nvidia L40S GPU, and % performance is defined as reference time divided by generated kernel time) “Untiled” by DALL·E (2025). (Digital pigment on virtual canvas)From the MMA collection Intro We started with the goal of generating synthetic data to train better kernel generation models. Somewhere along the way the unexpected happened: the test-time only synthetic data generation itself started producing really good kernels beating or performing close to human expert optimized PyTorch baselines, utilizing advanced optimizations and hardware features, which were previously thought to be challenging. As a result, we decided to write this blog post early and share our findings. The point of this blog post isn’t about a novel methodology; in fact, our synthetic data generation design is simple, and what’s surprising is that it is already showing promise. In this post, we’re sharing the method, five optimized kernels (4 foundati...

First seen: 2025-05-30 21:25

Last seen: 2025-06-01 04:30

Read Full Article More from this Source

Surprisingly Fast AI-Generated Kernels We Didn't Mean to Publish (Yet)

Summary

Related News

The Future of Comments Is Lies, I Guess

RSC for Lisp Developers

I like Svelte more than React (it's store management)

Tldx – CLI tool for fast domain name discovery

Reviving Astoria – Windows's Lost Android