The Mother-in-Law’s Guide to Software Testing

This is part of my Mother-in-Law’s Guide to Technology. My Mother-in-Law is a very smart woman even if she isn’t a “computer person.” The goal of this series is to take some big and treacherous sounding ideas and bring them down to earth.

Dearest Mother-in-Law,

Remember when you had kids and you told them to do stuff. And remember how they used to do what you told them but that wasn’t always what you intended them to do? Well, that’s the way computer programs work.

Just like kids, computer programs will do what you tell them, but beyond that, all bets are off. They don’t do anything that directly contradicts what you said but that doesn’t mean they’ll do what you want them to do.

That’s why you need test kids (or computer programs). Once you give them instructions, you need to make sure that these instructions are understood and that the kids will carry out the spirit of the instructions, not just the letter of them. You also need to make sure that the kids will know what to do in ambiguous situations and that the kids are positioned well to defend against any bad actors. You do this by testing the kids by watching them perform the task once or twice in a variety of circumstances. It’s much better to find out any problems or ambiguity in your instructions before the kids are sent out into the wild.

Let’s look at two kinds of kids that are representative of common software errors. There are many others, but these are two of the big ones. Testing allows you to find these errors early in the process, preventing these issues from causing big problems later on.

The Happy Path Kid

This kind of kid is naively optimistic. He also may not be very bright. When you tell him something he always says, “Of course. Sure. I can do that!” But he almost certainly doesn’t understand all the nuances of what you want him to do. If the instructions don’t exactly match the situation he’s in, trouble quickly ensues. For example, imagine this situation:

You yell at him, “That was so dangerous. I told you never to cross the street when the light is red. What were you thinking! You almost killed yourself!”

Only to hear him say back, “But there was no red light. In fact, there was no light at all.  So I crossed the street.”

When thinking about the Happy Path kid, make sure you examine all of the possible ambiguous situations so that he (or the program) doesn’t run into any confusing situations.

The Adversary

Then there’s the kid who intentionally tries to skirt the rules. He’s far more clever than you are. He doesn’t break any rules but you’re continually surprised how he can turn a little imprecision into a HUGE opportunity for misadventure. He’s the kind of kid who always wins at Scrabble because he’s looking up all the words on his phone. You assumed he wouldn’t do that but you never SAID he couldn’t.

While the Happy Path Kid may encounter ambiguous situations and be led astray, the Adversary will look for weaknesses in your system and figure out how to exploit them.

Here are two stories from my time at Yale where I ran into real-life examples of these “problem children.”

The Happy Path Problem: Heads

This situation came up during my freshman year. It was a beautiful day and we were sitting outside in a seminar called “Perspectives on Science.” As we were sitting on the lawn under a tree, one of my classmates was performing a demonstration of how probability works. If you flip a coin a large number of times, half of the time will be heads and the other half tails.

“Can someone give me a quarter?” she asked the group. My friend Christine excitedly reached into her pocket and grabbed a quarter. The first person flipped the coin.

“Heads,” they said.

“Heads,” the second said.







Flipping a Coin from Rosencrantz and Guildenstern are Dead

By this time the quarter came back to the leader who examined the coin. “Who walks around with a two-headed quarter?!” she blurted out in surprise. As it turns out, Christine did. At the time, we were obsessed about Tom Stoppard’s Rosencrantz and Guildenstern are  Dead which includes a scene where the laws of probability are broken with a coin that continuously lands on heads. So Christine got one.

The Adversary Problem: How to Cheat at Hangman

When I was at Yale, I took our most famous computer science class, CS223, with Professor Stanley Eisenstat. This is where I first learned about the adversary.

Professor Eisenstat Teaching How to Cheat at Hangman (pic via Twitter)

Professor Eisenstat Teaching How to Cheat at Hangman (pic via Twitter)

Professor Eisenstat wrote out a game of hangman on the board with 3 letters filled out:


We had 8 guesses to get this right. The class started shouting out different possibilities. We started very confidently with






This went on for quite a while as we gradually lost that confidence. Then Professor Eisenstat told us that there were many different words that this could be — far more than 8. Bill, Dill, Fill, Gill, Hill, Jill, Kill, Mill, Pill, Sill, Till, and Will. That’s 12 words if you’re counting. Here’s where the adversary comes in. Because Professor Eisenstat hadn’t committed to an answer beforehand, there was no way that we could win the game. When we chose a letter, he removed that word from the set of possible winners. He always had an option that we hadn’t chosen.

Summing Up

We are in an age when software is part of everything we do. We don’t have finance anymore but finance + computers. We don’t have cars anymore but cars + computers. With software being such an integral part of everything we do, it’s even more important to ensure that software does what we intend it to do.

Let’s take cars for example. We are on the cusp of self-driving cars. The happy path for self-driving cars is pretty easy — driving on a highway on a sunny day. However, there are many unhappy surprises that the car will encounter like people walking across a street carrying a giant piece of plywood or human drivers falling asleep at the wheel. The adversary in the self-driving car example is even more interesting. You have people trying to take advantage of the polite self-driving car and street signs can be modified so that cars will incorrectly identify them

Given how much we rely on computer code, let’s make sure that our software does exactly what we want it to do!

Note: I previous version of this post was titled Be Careful with Your Assumptions OR Who Would Have Thought that Would Happen?

The Mother-in-Law’s Guide to Cloud Computing

This is part of my “Mother-in-Law’s Guide to Technology.” My Mother-in-Law is a very smart woman even if she isn’t a “computer person.” The goal of this post is to take a very big and treacherous sounding idea and bring it down to earth. I tried this before in a post which I’ve now renamed The Mother-In-Law’s Guide to Chaos Engineering.

Dearest Mother-in-Law,

You know when we visit a Target or a Wal-Mart in the suburbs and they have 30 checkout lanes and only 3 are open at any time? I always wondered why that happens. It even sparked someone to write a funny blog post about the phenomenon: Target Store Opens More than Three Checkout Lanes; Shoppers Confused.

On a Normal Day, the Store Has Full Time Cashiers to Manage the Base Volume. When More People Come In, Part-Time Cashiers Will Be Engaged.

How many checkout lanes should Target build? At first, I thought about how many customers Target has on an average day and that they built that number of cash registers. If they have 300 customers in a day they would need enough cashiers to serve 300 people. The problem is that the flow of people into the store isn’t constant. For example, if the peak time of day is at 4PM and there are 10 people in line, Target can’t tell those people to come back at a less busy time. So on a daily basis, they need to plan for this by making sure they have enough checkout counters (and cashiers) available to keep the lines down to a reasonable level even when it gets busy. The way most retailers do this is to have only a few full-time dedicated as cashiers for the slow times and some other part-time cashiers that mainly do another job but can jump in when the store gets busy.

But that doesn’t answer the question of how many checkout lanes they need to build. Target needs to have enough checkout lanes so that even on the busiest days, they can hire enough part-time cashiers to keep lines relatively short. This means that Target needs to build the number of checkout lanes that they need for the busy time of the year, not for the peak time of day. At Target, this is the Christmas shopping season starting with Black Friday. On Black Friday the store is filled with shoppers struggling to check out. This is the day that Target opens up all their checkout lanes. So even though they’re not used a good portion of the year, Target still needs to build the number of checkout lanes they need for Black Friday.

The Number of People on Black Friday is Much Greater Than That of a Normal Day. This Drives the Total Number of Checkout Lanes.

So what does this have to do with cloud computing? Cloud computing is like Target having these checkout lanes only where they’re needed, like on Black Friday. They wouldn’t have to pay for the cost of having these checkout lanes at less busy times of the year. They would be able to create new ones during the Christmas season and get rid of them at other times of the year. How does this work? Instead of buying checkout lanes (or in the case of cloud computing, computers), they just rent what you need. This means that Target can increase or decrease their capacity based on the actual need from your customers.

Now let’s make the jump from Target to Cloud Computing by defining a few Let’s define a few things:

  • Servers: These are computers that “serve up” the information you need. Just like the cashiers at Target, if a server isn’t available you’re going to have to wait in line.
  • Server Capacity: This is the total number of servers that can be available to provide information. Just like the number of checkout lanes at Target, once you’re out of checkout lanes, you can’t have any more cashiers.
  • Peak Request times: This is your Black Friday time when you the most requests.

Now you can understand one of the key benefits of cloud computing:

Cloud computing provides flexible server capacity to meet demand during peak request times and release that capacity at during other periods.

So there you have it. In the real world, Target needs to build enough capacity (checkout lanes)  to meet demand during peak request times (Black Friday). But the cloud computing model allows companies to greatly reduce their capacity during non-peak times because they can easily turn on or turn off this capacity.

Note: You can actually see a checkout model like this (sans the physical checkout lanes) at Apple stores. They can easily increase or decrease capacity because they don’t have any physical checkout aisles. This allows for flexibility by just adding or removing salespeople to the store with their mobile checkout devices. 

Additional Resources: For more information on managing lines check out this quick overview from FiveThirtyEight. For more on Cloud Computing, take a look at Google Cloud Platform training or Amazon Web Services training. You can audit classes for free.



The Mother-in-Law’s Guide to Chaos Engineering

In this post, I’m trying to take something technical and make it (mostly) readable for my mother-in-law. Enjoy!

One big trend, especially for internet companies like Facebook, Google and Netflix, is not to have one massive computer anymore. This is an oversimplification but computers used to be one big expensive box. The faster the computer you needed, the more money you spent. But eventually, the computers became too expensive to possibly meet the needs of today’s internet companies. So Netflix (and others) started stitching together these large supercomputers out of many smaller and cheaper computers by connecting them in these clever ways.

The benefits of doing this are pretty amazing because they allow you to get this supercomputer that can do incredible things that are very low cost. The problem is with each of the smaller computers. Because they’re so cheap, they can fail at any time. This means that Netflix has computers failing constantly. But customers don’t see this happening. So how does Netflix get this to work?

Netflix needs to make sure that of all its computers and systems are resilient. Using a car metaphor, Netflix is always able to swap out a spare tire if one gets a flat. On their blog, Netflix explains how they test this tire changing/computer failing problem:

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.

In addition to Chaos Monkey, Netflix has a number of other members of the Simian Army. The Netflix descriptions of these fellows is a bit technical:

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.