A year ago, Mihail Eric wrote a blog post detailing his experiences working on AI inside Amazon: How Alexa Dropped the Ball on Being the Top Conversational System on the Planet. It’s a great first-person account, with lots of detail of the issues that kept Amazon from keeping up with its peers in the LLM space. From my perspective, Eric’s post makes a great case study in what resilience engineering researchers refer to as brittleness, which is a term that the researchers use to refer to as a kind of opposite of resilience. In the paper Basic Patterns in How Adaptive Systems Fail, the researchers David Woods and Matthieu Branlat note that brittle systems tend to suffer from the following three patterns: Decompensation: exhausting capacity to adapt as challenges cascade Working at cross-purposes: behavior that is locally adaptive but globally maladaptive Getting stuck in outdated behaviors: the world changes but the system remains stuck in what were previously adaptive strategies (over-relying on past successes) Eric’s post demonstrates how all three of these patterns were evident within Amazon. Decompensation It would take weeks to get access to any internal data for analysis or experiments… Experiments had to be run in resource-limited compute environments. Imagine trying to train a transformer model when all you can get a hold of is CPUs. Unacceptable for a company sitting on one of the largest collections of accelerated hardware in the world. If you’ve ever seen a service fall over after receiving a spike in external requests, you’ve seen a decompensation system failure. This happens when a system isn’t able to keep up with the demands that are placed upon on it. In organizations, you can see the decompensation failure pattern emerge when decision-making is very hierarchical: you end up having to wait for the decision request to make its way up to someone who has the authority to make the decision, and then make its way down again. In the meantime, the world isn’t...
First seen: 2025-06-11 15:29
Last seen: 2025-06-11 19:32