Complexity and Resiliency

Published May 31, 2013 in Software, Technology, Ops

I saw this tweet from @allspaw who always shares good stuff about Ops. The linked article has some great points. For me the most important one is:

…the risk of Amazon going down is mitigated by building a system of redundant barriers (several server centers, backup, active fire extinguishing, etc.). This might seem like a tidy solution, but here we run into two problems with this probabilistic approach to risk: the view of the human operating the system and the increased complexity that comes as a result of introducing more and more barriers.

Basically, protecting against failure increases system complexity and can in turn lead to more failures. Building resilient systems isn’t about finding all risks and protecting against all possible scenarios, it’s about understanding your risks as well as the trade-offs in protecting against them.

How to Speak to a Technical Person

Published May 30, 2013 in Misc