Until you attempt to internalize some of this, you forfeit credibility in commenting on large-scale outages: programming.oreilly.com/2013/05/what-i… #devops— John Allspaw (@allspaw) May 31, 2013
…the risk of Amazon going down is mitigated by building a system of redundant barriers (several server centers, backup, active fire extinguishing, etc.). This might seem like a tidy solution, but here we run into two problems with this probabilistic approach to risk: the view of the human operating the system and the increased complexity that comes as a result of introducing more and more barriers.
Basically, protecting against failure increases system complexity and can in turn lead to more failures. Building resilient systems isn’t about finding all risks and protecting against all possible scenarios, it’s about understanding your risks as well as the trade-offs in protecting against them.