The Speed of VITs and CNNs

https://news.ycombinator.com/rss Hits: 6

Summary

You disabled JavaScript. Please enable it for syntax-highlighting, or don't complain about unlegible code snippets =) This page doesn't contain any tracking/analytics/ad code. Context Computer vision is now powered by two workhorse architectures: Convolutional Neural Networks (CNN) and Vision Transformers (ViT). CNNs slide a feature extractor (stack of convolutions) over the image to get the final, usually lower-resolution, feature map on which the task is performed. ViTs on the other hand cut the image into patches from the start and perform stacks of self-attention on all the patches, leading to the final feature map, also of lower resolution. It is often stated that because of the quadratic self-attention, ViTs aren't practical at higher resolution. As the most prominent example, here is Yann LeCun, Godfather of CNNs, stating the following: However, I believe this criticism is a misguided knee-jerk reaction and, in practice, ViTs scale perfectly fine up to at least 1024x1024px², which is enough for the vast majority of usage scenarios for image encoders. In this article, I make two points: ViTs scale just fine up to at least 1024x1024px² For the vast majority of uses, that resolution is more than enough. ViTs scale just fine with resolution First, I set out to quantify the inference speed of plain ViTs and CNNs on a range of current GPUs. To give this benchmark as wide an appeal as possible, I stray away from my usual JAX+TPU toolbox and perform benchmarking using PyTorch on a few common GPUs. I use models from the de-facto standard vision model repository timm, and follow PyTorch best practices in terms of benchmarking and performance by using torch.compile. I further sweep over dtype (float32, float16, bfloat16), attention implementation (sdpa_kernel), and matmul precision (set_float32_matmul_precision) and take the best setting among all these for each measurement. Since I am quite rusty in PyTorch, here is my full benchmarking code, and I'll be glad to take f...

First seen: 2025-05-04 15:48

Last seen: 2025-05-04 20:49

Read Full Article More from this Source

The Speed of VITs and CNNs

Summary

Related News

Typed Lisp, a Primer

Reading Zanzibar

In kids, EEG monitoring of consciousness safely reduces anesthetic use

I'd rather read the prompt

Orders of Infinity