How to Scale Your Model: How to Think About GPUs

https://news.ycombinator.com/rss Hits: 23

Summary

Contents What Is a GPU? A modern ML GPU (e.g. H100, B200) is basically a bunch of compute cores that specialize in matrix multiplication (called Streaming Multiprocessors or SMs) connected to a stick of fast memory (called HBM). Here’s a diagram: Figure: a diagram showing the abstract layout of an H100 or B200 GPU. An H100 has 132 SMs while a B200 has 148. We use the term 'Warp Scheduler' somewhat broadly to describe a set of 32 CUDA SIMD cores and the scheduler that dispatches work to them. Note how much this looks like a TPU! Each SM, like a TPU’s Tensor Core, has a dedicated matrix multiplication core (unfortunately also called a Tensor CoreThe GPU Tensor Core is the matrix multiplication sub-unit of the SM, while the TPU TensorCore is the umbrella unit that contains the MXU, VPU, and other components.), a vector arithmetic unit (called a Warp SchedulerNVIDIA doesn't have a good name for this, so we use it only as the best of several bad options. The Warp Scheduler is primarily the unit that dispatches work to a set of CUDA cores, but we use it here to describe the control unit and the set of cores it controls.), and a fast on-chip cache (called SMEM). Unlike a TPU, which has at most 2 independent “Tensor Cores”, a modern GPU has more than 100 SMs (132 on an H100). Each of these SMs is much less powerful than a TPU Tensor Core but the system overall is more flexible. Each SM is more or less totally independent, so a GPU can do hundreds of separate tasks at once.Although SMs are independent, they are often forced to coordinate for peak performance because they all share a capacity-limited L2 cache. Let’s take a more detailed view of an H100 SM: Figure: a diagram of an H100 SM (source) showing the 4 subpartitions, each containing a Tensor Core, Warp Scheduler, Register File, and sets of CUDA Cores of different precisions. The 'L1 Data Cache' near the bottom is the 256kB SMEM unit. A B200 looks similar, but adds a substantial amount of Tensor Memory (TMEM) for feedi...

First seen: 2025-08-20 01:02

Last seen: 2025-08-20 23:29

Read Full Article More from this Source

How to Scale Your Model: How to Think About GPUs

Summary

Related News

Kerberoasting

Knowledge and Memory

iPhone Air

Children and young people's reading in 2025

Hypervisor in 1k Lines