We Uncovered a Race Condition in Aurora RDS

https://news.ycombinator.com/rss Hits: 20
Summary

Much of the developer world is familiar with the AWS outage in us-east-1 that occurred on October 20th due to a race condition bug inside a DNS management service. The backlog of events we needed to process from that outage on the 20th stretched our system to the limits, and so we decided to increase our headroom for event handling throughput. When we attempted that infrastructure upgrade on October 23rd, we ran into yet another race condition bug in Aurora RDS. This is the story of how we figured out it was an AWS bug (later confirmed by AWS) and what we learned. Background The Hightouch Events product enables organizations to gather and centralize user behavioral data such as page views, clicks, and purchases. Customers can setup syncs to load events into a cloud data warehouse for analytics or stream them directly to marketing, operational, and analytics tools to support real-time personalization use cases. Here is the portion of Hightouch鈥檚 architecture dedicated to our events system: Hightouch events system architecture Our system scales on three levers: Kubernetes clusters that contain event collectors and batch workers, Kafka for event processing, and Postgres as our virtual queue metadata store. When our pagers went off during the AWS outage on the 20th, we observed: Services were unable to connect to Kafka brokers managed by AWS MSK.Services struggled to autoscale because we couldn鈥檛 provision new EC2 nodes.Customer functions for realtime data transformation were unavailable due to AWS STS errors, which caused our retry queues to balloon in size. Kafka鈥檚 durability meant that no events were dropped once they were accepted by the collectors, but there was a massive backlog to process. Syncs with consistently high traffic or with enrichments that needed to call slower 3rd party services took longer to catch up and were testing the limits of our (small) Postgres instance鈥檚 ability to act as a queue for the batch metadata. As an aside, at Hightouch, we start wi...

First seen: 2025-11-14 18:52

Last seen: 2025-11-15 13:55