Gandi March 9, 2025 incident postmortem

https://news.ycombinator.com/rss Hits: 4
Summary

On Sunday 2025-03-09, Gandi experienced a major incident on its platform caused by a filer storage system outage affecting multiple services including mailboxes. Summary:– Multiple services were severely disrupted from Sunday March 9th 0:31:10 until 16:49:15 including 39% of all mailboxes– Some Mailboxes (~15%) remained unavailable until Monday March 10th 10:29 . However, all users had recovered all of their emails by Wednesday March 12th 17:00:00. – Importantly, this incident did not result in the loss or corruption of any data. What was the root cause of the incident?: The main cause was the failure of an SSD storage filer. However, several additional factors contributed to the severity of the impact: Some systems, including internal monitoring, lacked effective redundancy measures to cope with the storage disruption Some systems which did have redundancy at the VM level were incorrectly architected so that all the VMs relied on the single impacted filer Some systems that were redundant at both the VM and storage level were not provisioned with enough capacity to handle the increased load when one of the instances failed. Full timeline: Time stamps (UTC)Event2025-03-09 00:31:10Incident started, and on-call responders began investigating over 1500 alerts; difficult to know what was the root cause, and the monitoring bot was unavailable2025-03-09 01:11:19Incident was escalated and CTO responded2025-03-09 01:21:51Public status published on status.gandi.net with the first impacted services identified2025-03-09 01:23:31Attempt to declare incident via ChatOps tooling2025-03-09 01:25:15VPN outage identified for non Ops team employees2025-03-09 01:33:03Problem identified: a filer has crashed2025-03-09 01:34:46Filer restart attempted2025-03-09 01:47:09Filer restart failed2025-03-09 02:16:21Responder dispatched to datacenter2025-03-09 03:31:11First report from datacenter – filer restarted manually after power disconnection2025-03-09 04:03:05Attempted restart failed to resol...

First seen: 2025-05-05 12:51

Last seen: 2025-05-05 15:52