PyTorch Monarch

https://news.ycombinator.com/rss Hits: 28
Summary

We now live in a world where ML workflows (pre-training, post training, etc) are heterogeneous, must contend with hardware failures, are increasingly asynchronous and highly dynamic. Traditionally, PyTorch has relied on an HPC-style multi-controller model, where multiple copies of the same script are launched across different machines, each running its own instance of the application (often referred to as SPMD). ML workflows are becoming more complex: pre-training might combine advanced parallelism with asynchrony and partial failure; while RL models used in post-training require a high degree of dynamism with complex feedback loops. While the logic of these workflows may be relatively straightforward, they are notoriously difficult to implement well in a multi-controller system, where each node must decide how to act based on only a local view of the workflow’s state. We believe that the long-term sustainable way to address this is through a single controller programming model, in which a single script orchestrates all distributed resources, making them feel almost local. This architectural shift simplifies distributed programming—your code looks and feels like a single-machine Python program, but can scale across thousands of GPUs. You can directly use Pythonic constructs—classes, functions, loops, tasks, futures—to express complex distributed algorithms. We’re excited to introduce Monarch, a distributed programming framework that brings the simplicity of single-machine PyTorch to entire clusters. Monarch lets you program distributed systems the way you’d program a single machine, hiding the complexity of distributed computing: Program clusters like arrays. Monarch organizes hosts, processes, and actors into scalable meshes that you can manipulate directly. You can operate on entire meshes (or slices of them) with simple APIs—Monarch handles the distribution and vectorization automatically, so you can think in terms of what you want to compute, not where the code ...

First seen: 2025-10-23 11:30

Last seen: 2025-10-24 14:37