Open software capabilities for training and inference The real value of hardware is unlocked by co-designed software. AI Hypercomputer’s software layer helps AI practitioners and engineers move faster with open and popular ML frameworks and libraries such as PyTorch, JAX, vLLM, and Keras. For infrastructure teams, that translates to faster delivery times and more cost-efficient resource utilization. We’ve made significant advances in software for both AI training and inference. Pathways on Cloud: Pathways, developed by Google DeepMind, is the distributed runtime that powers Google’s internal large-scale training and inference infrastructure, now available for the first time on Google Cloud. For inference, it includes features like disaggregated serving, which allows dynamic scaling of the prefill and decode stages of inference workloads on separate compute units, each independently scaling to deliver ultra-low latency and high throughput. It is available to customers through JetStream, our high-throughput and low-latency inference library. Pathways also enables elastic training, allowing your training workloads to automatically scale down on failure and scale up on recovery while providing continuity. To learn more about Pathways on Cloud, including additional use cases for the Pathways architecture, read the documentation. Train models with high performance and reliability Training workloads are highly synchronized jobs that run across thousands of nodes. A single degraded node has the potential to disrupt an entire job, resulting in longer time-to-market and higher costs. To provision a cluster quickly you need VMs tuned for specific model architectures located in close proximity. You also need the ability to predict and troubleshoot node failures quickly and ensure workload continuity in the event of a failure. Cluster Director for GKE and Cluster Director for Slurm. Cluster Director (formerly Hypercompute Cluster) lets you deploy and manage a group of accelerato...
First seen: 2025-04-10 02:41
Last seen: 2025-04-10 16:45