What Now? Handling Errors in Large Systems

https://news.ycombinator.com/rss Hits: 2
Summary

What Now? Handling Errors in Large Systems More options means more choices. Cloudflare’s deep postmortem for their November 18 outage triggered a ton of online chatter about error handling, caused by a single line in the postmortem: .unwrap() If you’re not familiar with Rust, you need to know about Result, a kind of struct that can contain either a succesful result, or an error. unwrap says basically “return the successful results if their is one, otherwise crash the program”1. You can think of it like an assert. There’s a ton of debate about whether asserts are good in production2, but most are missing the point. Quite simply, this isn’t a question about a single program. It’s not a local property. Whether asserts are appropriate for a given component is a global property of the system, and the way it handles data. Let’s play a little error handling game. Click the ✅ if you think crashing the process or server is appropriate, and the ❌ if you don’t. Then you’ll see my vote and justification. One of ten web servers behind a load balancer encounters uncorrectable memory errors, and takes itself out of service. One of ten multi-threaded application servers behind a load balancer encounters a null pointer in business logic while processing a customer request. One database replica receives a logical replication record from the primary that it doesn't know how to process One web server receives a global configuration file from the control plane that appears malformed. One web server fails to write its log file because of a full disk. If you don’t want to play, and just see my answers, click here: Show All Answers. There are three unifying principles behind my answers here. Are failures correlated? If the decision is a local one that’s highly likely to be uncorrelated between machines, then crashing is the cleanest thing to do. Crashing has the advantage of reducing the complexity of the system, by removing the working in degraded mode state. On the other hand, if failure...

First seen: 2025-11-26 01:27

Last seen: 2025-11-26 02:27