'I paid for the whole GPU, I am going to use the whole GPU'

https://news.ycombinator.com/rss Hits: 5
Summary

Typical attire of a GPU utilization maximizer. Graphics Processing Units, or GPUs, are the hottest mathematical co-processor since the FM synthesis chips that shaped the sounds of the 1990s. Like all co-processors, they are chosen when the performance of more flexible commodity hardware, like an x86 Central Processing Unit (CPU), is insufficient. GPUs are in particular designed for problems where CPUs cannot achieve the desired throughput of mathematical operations (in particular, matrix multiplications). But GPUs are not cheap: high performance can command a high price. Combined together, the high price, performance sensitivity, and throughput-orientation of GPU applications mean that a large number of engineers and technical leaders find themselves concerned with GPU utilization of some form or another — “we’re paying a lot, so we’d better be using what we’re paying for”. At Modal, we have our own GPU utilization challenges to solve and we help our users solve theirs. We’ve noticed that the term “GPU utilization” gets used to mean very different things by people solving problems at different parts of the stack. So we put together this article to share our framework for thinking about GPU utilization across the stack and the tips and tricks we’ve learned along the way. In particular, we’ll talk about three very different things that all get called “GPU utilization”: We’ll specifically focus on neural network inference workloads — neural networks because they are workload generating the most demand right now and inference because, unlike training, inference is a revenue center not a cost center. We’re betting on the revenue center. What is utilization? Utilization = Output achieved ÷ Capacity paid for Utilization relates the available capacity of a system to that system’s output. In throughput-oriented systems like GPU applications, the capacity paid for is often a bandwidth (e.g. the arithmetic bandwidth) and the output achieved is then a throughput (e.g. floating ...

First seen: 2025-05-07 22:07

Last seen: 2025-05-08 02:07