Thinking Machines – Modular Manifolds

https://news.ycombinator.com/rss Hits: 21
Summary

When we train large neural networks, we need to keep them healthy. We do not want the tensors in the network—either the weights, activations or gradients—to grow too large or too small. Very small and very large tensors cause a variety of problems not just limited to numerical underflow and overflow. For example, weight matrices changing size during training makes it harder to design training algorithms—since the relative size of updates to weights has a significant impact on the speed of learning. The gold standard for keeping tensors healthy is to normalize them. Normalization is commonplace for activation vectors, where we use techniques like layer norm to put the activations on a good scale before passing them to the next layer. It is also commonplace to normalize gradient updates, where we can interpret fast training algorithms like the Muon optimizer as spectrally normalizing the updates. Normalization provides us with certainty about the sizes of tensors—without needing to check Wandb!—and when training large neural networks with many interacting components, having certainty about the network internals is valuable. Normalization is less commonly applied to weight matrices, although it is not unheard of. For example, the EDM2 diffusion model codebase uses weight constraints and the authors report benefits in their paper. Various other techniques have been proposed but are not common practice in modern large-scale training.For some examples, see Salimans et al, 2016, Miyato et al, 2018 and our paper Liu et al, 2021. Normalizing the weight matrices might be a good idea for a few reasons. Weight constraints make understanding the relative size of optimization updates easier. They remove the problem of weight norms exploding. They allow us to focus hyperparameter tuning effort on tensors whose size matters most. They can force matrices to have a small condition number, making their behaviour more predictable. And relatedly, weight constraints facilitate Lipschitz ...

First seen: 2025-09-26 17:19

Last seen: 2025-09-27 13:22