Building a 30 PB storage cluster in the heart of SF

https://news.ycombinator.com/rss Hits: 27
Summary

We built a storage cluster in downtown SF to store 90 million hours worth of video data. Why? We’re pretraining models to solve computer use. Compared to text LLMs like LLaMa-405B, which require ~60 TB of text data to train, videos are sufficiently large that we need 500 times more storage. Instead of paying the $12 million / yr it would cost to store all of this on AWS, we rented space from a colocation center in San Francisco to bring that cost down ~40x to $354k per year, including depreciation. Why Our use case for data is unique. Most cloud providers care highly about redundancy, availability, and data integrity, which tends to be unnecessary for ML training data. Since pretraining data is a commodity—we can lose any individual 5% with minimal impact—we can handle relatively large amounts of data corruption compared to enterprises who need guarantees that their user data isn’t going anywhere. In other words, we don’t need AWS’s 13 nines of reliability; 2 is more than enough. Additionally, storage tends to be priced substantially above cost. Most companies use relatively small amounts of storage (even ones like Discord still use under a petabyte for messages), and the companies that use petabytes are so large that storage remains a tiny fraction of their total compute spend. Data is one of our biggest contraints, and would be prohibitively expensive otherwise. As long as the cost predictions work out in favor of a local datacenter, and it would not consume too much of the core team’s time, it would make sense to stack hard drives ourselves. [1] 1. We talked to some engineers at the Internet Archive, which had basically the same problem as us; even after massive friends & family discounts on AWS, it was still 10 times more cost-effective to buy racks and store the data themselves! The Cost Breakdown: Cloud Alternatives vs In-House Internet and electricity total $17.5k as our only recurring expenses (the price of colocation space, cooling, etc were bundled into el...

First seen: 2025-10-01 15:43

Last seen: 2025-10-02 17:49