Complexity and Resiliency
I saw this tweet from @allspaw who always sha...
Until you attempt to internalize some of this, you forfeit credibility in commenting on large-scale outages: programming.oreilly.com/2013/05/what-i… #devops
— John Allspaw (@allspaw) May 31, 2013
I saw this tweet from @allspaw who always shares good stuff about Ops. The linked article has some great points. For me the most important one is:
…the risk of Amazon going down is mitigated by building a system of redundant barriers (several server centers, backup, active fire extinguishing, etc.). This might seem like a tidy solution, but here we run into two problems with this probabilistic approach to risk: the view of the human operating the system and the increased complexity that comes as a result of introducing more and more barriers.
Basically, protecting against failure increases system complexity and can in turn lead to more failures. Building resilient systems isn’t about finding all risks and protecting against all possible scenarios, it’s about understanding your risks as well as the trade-offs in protecting against them.