ARC338: How AWS Minimizes the Blast Radius of Failures

Published: Dec. 1, 2018, midnight

b'At AWS, we obsess over operational excellence. We have a deep understanding of system availability, informed by over a decade of experience operating the cloud and our roots of operating Amazon.com for nearly a quarter-century. One thing we\'ve learned is that failures come in many forms, some expected, and some unexpected. It\'s vital to build from the ground up and embrace failure. A core consideration is how to minimize the "blast radius" of any failures. In this talk, we discuss a range of blast radius reduction design techniques that we employ, including cell-based architecture, shuffle-sharding, availability zone independence, and region isolation. We also discuss how blast radius reduction infuses our operational practices.'