Categories
MIL Guide to Technology Product Management

The Mother-in-Law’s Guide to Chaos Engineering

In this post, I’m trying to take something technical and make it (mostly) readable for my mother-in-law. Enjoy!

One big trend, especially for internet companies like Facebook, Google and Netflix, is not to have one massive computer anymore. This is an oversimplification but computers used to be one big expensive box. The faster the computer you needed, the more money you spent. But eventually, the computers became too expensive to possibly meet the needs of today’s internet companies. So Netflix (and others) started stitching together these large supercomputers out of many smaller and cheaper computers by connecting them in these clever ways.

The benefits of doing this are pretty amazing because they allow you to get this supercomputer that can do incredible things that are very low cost. The problem is with each of the smaller computers. Because they’re so cheap, they can fail at any time. This means that Netflix has computers failing constantly. But customers don’t see this happening. So how does Netflix get this to work?

Netflix needs to make sure that of all its computers and systems are resilient. Using a car metaphor, Netflix is always able to swap out a spare tire if one gets a flat. On their blog, Netflix explains how they test this tire changing/computer failing problem:

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.

In addition to Chaos Monkey, Netflix has a number of other members of the Simian Army. The Netflix descriptions of these fellows is a bit technical:

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.