The Mother-in-Law’s Guide to Software Testing

This is part of my Mother-in-Law’s Guide to Technology. My Mother-in-Law is a very smart woman even if she isn’t a “computer person.” The goal of this series is to take some big and treacherous sounding ideas and bring them down to earth.

Dearest Mother-in-Law,

Remember when you had kids and you told them to do stuff. And remember how they used to do what you told them but that wasn’t always what you intended them to do? Well, that’s the way computer programs work.

Just like kids, computer programs will do what you tell them, but beyond that, all bets are off. They don’t do anything that directly contradicts what you said but that doesn’t mean they’ll do what you want them to do. Continue reading “The Mother-in-Law’s Guide to Software Testing”

The Mother-in-Law’s Guide to Cloud Computing

This is part of my “Mother-in-Law’s Guide to Technology.” My Mother-in-Law is a very smart woman even if she isn’t a “computer person.” The goal of this post is to take a very big and treacherous sounding idea and bring it down to earth. I tried this before in a post which I’ve now renamed The Mother-In-Law’s Guide to Chaos Engineering.

Dearest Mother-in-Law,

You know when we visit a Target or a Wal-Mart in the suburbs and they have 30 checkout lanes and only 3 are open at any time? I always wondered why that happens. It even sparked someone to write a funny blog post about the phenomenon: Target Store Opens More than Three Checkout Lanes; Shoppers Confused. Continue reading “The Mother-in-Law’s Guide to Cloud Computing”

The Mother-in-Law’s Guide to Chaos Engineering

In this post, I’m trying to take something technical and make it (mostly) readable for my mother-in-law. Enjoy!

One big trend, especially for internet companies like Facebook, Google and Netflix, is not to have one massive computer anymore. This is an oversimplification but computers used to be one big expensive box. The faster the computer you needed, the more money you spent. But eventually, the computers became too expensive to possibly meet the needs of today’s internet companies. So Netflix (and others) started stitching together these large supercomputers out of many smaller and cheaper computers by connecting them in these clever ways.

The benefits of doing this are pretty amazing because they allow you to get this supercomputer that can do incredible things that are very low cost. The problem is with each of the smaller computers. Because they’re so cheap, they can fail at any time. This means that Netflix has computers failing constantly. But customers don’t see this happening. So how does Netflix get this to work?

Netflix needs to make sure that of all its computers and systems are resilient. Using a car metaphor, Netflix is always able to swap out a spare tire if one gets a flat. On their blog, Netflix explains how they test this tire changing/computer failing problem:

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.

In addition to Chaos Monkey, Netflix has a number of other members of the Simian Army. The Netflix descriptions of these fellows is a bit technical:

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.