Starting in May, we had a series of feature launches with agentic AI partners that gained far more momentum than we predicted. In two short timespans, the rate of new database creation increased more than 5x, and the rate of branch creation increased more than 50x. While we were humbled by the uptick, the significant burst in operational load caused a lot of strain in the Neon platform, manifesting as more incidents over the course of the two months than the entire year before. We understand that databases are some of the most critical operational infrastructure for our customers, and stability is paramount. The problem should have never happened and I am embarrassed for this incident and sorry to cause pain to our customers and our team. In this blog post, we explain the underlying causes for these incidents and what we are doing to avoid these categories of incidents in the future. May incidents were caused by us hitting a scaling limit around the number of active databases in US regions before our solution (Cells) was ready. Every active database on Neon is a running pod in a Kubernetes cluster. Our testing of Kubernetes showed service degradation beyond 10,000 concurrent databases. Among multiple issues discovered in testing, we approached the EKS etcd memory limit of 8GB and pod start time fell below our targets. In addition, in our us-east-1 cluster, our network configuration limited us to ~12,000 concurrently active databases. In January 2025, we forecasted that we would hit these limits by the end of the year. While we believe that we can iterate on our Kubernetes configuration to vertically scale higher, there are many reliability and resiliency advantages to a horizontally-scaled architecture. As a result, we started working on a horizontally-scalable architecture called “Cells” where each region can have multiple Neon deployments. This was a major project requiring substantial changes to the Terraform code that provisions regions. We planned this project,...
First seen: 2025-07-16 21:12
Last seen: 2025-07-17 03:13