The Mother-in-Law’s Guide to Chaos Engineering

In this post, I’m trying to take something technical and make it (mostly) readable for my mother-in-law. Enjoy!

One big trend, especially for internet companies like Facebook, Google and Netflix, is not to have one massive computer anymore. This is an oversimplification but computers used to be one big expensive box. The faster the computer you needed, the more money you spent. But eventually, the computers became too expensive to possibly meet the needs of today’s internet companies. So Netflix (and others) started stitching together these large supercomputers out of many smaller and cheaper computers by connecting them in these clever ways.

The benefits of doing this are pretty amazing because they allow you to get this supercomputer that can do incredible things that are very low cost. The problem is with each of the smaller computers. Because they’re so cheap, they can fail at any time. This means that Netflix has computers failing constantly. But customers don’t see this happening. So how does Netflix get this to work?

Netflix needs to make sure that of all its computers and systems are resilient. Using a car metaphor, Netflix is always able to swap out a spare tire if one gets a flat. On their blog, Netflix explains how they test this tire changing/computer failing problem:

Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.

This was our philosophy when we built Chaos Monkey, a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption. By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.

In addition to Chaos Monkey, Netflix has a number of other members of the Simian Army. The Netflix descriptions of these fellows is a bit technical:

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

Iatrogenics OR When Doing Nothing Might Be the Best Alternative

i·at·ro·gen·ic /īˌatrəˈjenik/
Relating to illness caused by medical
examination or treatment.
— Google Definitions

I learned about the word iatrogenic when reading the book Writing to Learn by William Zinsser. The book, written in 1984, used the following passage as an example of medical writing. It talks about the link between medical prescriptions and opium addiction:

The medical profession has a long record of treating patients with useless or harmful relatives, often in clinical settings of complete mutual confidence. Iatrogenic diseases, complications and injury have been, in fact, common in the history of medicine. Only look upon addiction to certain dispensed drugs as one variation among the occasional effects of drug therapy.

I thought, “What an interesting new word!” as did Zinsser who also had to look it up. Then I came across Nicholas Nassim Taleb’s book Antifragile and found that he also fell in love with the word and expanded the idea into a class of issues that he called iatrogenics that went beyond medicine.

Iatrogenics are different from malpractice. Malpractice is doing an operation wrong. Iatrogenics is about doing a treatment correctly but it still having harmful side effects. When doctors ignore these side effects, they are far more likely to use all the tools at their disposal, like drugs or surgery,  whether or not it’s a good idea in the long term.

Let’s look at a recent example. The New York Times recently published Heart Stents Are Useless for Most Stable Patients. They’re Still Widely Used. While they have no medical benefit, putting in a stent makes both doctors and patients feel like they are doing something — that they are in control. And, from both points of view, “they seem to work,” even though they don’t work any better than a placebo.

So what’s the harm in that? Everyone’s happy aren’t they? Well no, they’re not. Doctors are performing an operation that does no better than a placebo so there’s no upside. However, there’s a significant downside in the complications from the operation.

Or take another example from a cruise I went on. Cruises offer Wi-Fi on the ship with tiny data limits (50MB for the whole trip). This is so small that just opening my phone will go over this limit. So a cruise director offered, “Give me your phone and I’ll make it work on the boat.” So I gave him the phone and he starting turning off these data hogging applications. A few months later I realized that one of the things he turned off was my iCloud backup. So the decision that the cruise director made, without telling me, was to give me very limited internet functionality on the boat while turning off my critical backup capability.

Another way of looking at iatrogenics is overvaluing of short term gains vs. long term risks. Take the example of Thalidomide, the poster child for drug overuse. Thalidomide was a sedative that was prescribed around 1960. While it helped women with morning sickness (a relatively minor problem) it caused tens of thousands of serious birth defects.

Indulge me with one more example. When George Washington had left the presidency he’d taken ill. His treatment was the standard for the day — bleeding. However, taking 5 to 7 pounds of blood from Washington’s body is now widely believed to accelerate his death. Bleeding stayed around for a while after that. It was still recommended by leading doctors as late as 1909.

Taleb tells one story of how this problem goes beyond medicine and into finance:

One day in 2003, Alex Berenson, a New York Times journalist, came into my office with the secret risk reports of Fannie Mae, given to him by a defector. It was the kind of report getting into the guts of the methodology for risk calculation that only an insider can see—Fannie Mae made its own risk calculations and disclosed what it wanted to whomever it wanted, the public or someone else. But only a defector could show us the guts to see how the risk was calculated.

We looked at the report: simply, a move upward in an economic variable led to massive losses, a move downward (in the opposite direction), to small profits. Further moves upward led to even larger additional losses and further moves downward to even smaller profits.

At its core, this was what caused the financial crisis. It was people adding more and more risk for smaller and smaller gains. They failed to look at the downside risks which kept growing larger and larger because they couldn’t imagine that they would occur.

Oddly enough, people don’t get in trouble for doing this. There’s a general sense that the people causing the problems were doing the best they could. The idea of “this is the best modern medicine (or modern finance) has — even if it doesn’t work” is well accepted. This is true even when the procedure is successful but the patient died or the economy collapsed.

A lot of this happens because the people making the decisions don’t have skin in the game. They get the upside benefits without being exposed to the downside risk. Taleb mentions that when Roman engineers built a bridge, they were required to sleep under it. Then, if the bridge fell down, the engineers would feel the pain (or death in this case) of the people who were hurt by the bridge.

So what can you do about all this? Try to get your doctor to put a little skin in the game. The next time you have an important medical decision to make, don’t ask your doctor for her medical opinion, ask her what she would do if she were in your place. This changes her mindset from a “disinterested professional” to someone with a personal stake in the game. You might get a very different answer.

Read this along with my story on back pain.

Prospect Theory in Real Life OR How Losing Feels Bad More than Winning Feels Good

I’m going to do a magic trick with a number. I’m going to take a number 1700 and by doing nothing more than raising and lowering it, I’m to show how the interpretation of the number can dramatically change.  Let’s see how that can happen and then I’ll explain how that works.

When my wife was pregnant with our second son, we had a test for Downs Syndrome. This test had three parts:

  1. A “Nucal” sonogram that measured some key ratios. This was the most important test and sets the baseline.
  2. A blood test that measured blood proteins in the mother.
  3. A test of “soft markers” that refined the initial estimates based on other sonogram features.

So we had the initial test. The chance of an issue was 1 in 1700.

“Is that good?” We asked the doctor. “It sounds good to us.”

“Well, in order to be certain, you’d need to have an amniocentesis which has a 1 in 400 chance of serious problems,” said the doctor.

So 1 in 1700 is pretty darn good. Then we got the blood test back. The numbers were even better. Our chances now were 1 in 6800. That was 4 times better than we’d had before!

So we’d finished 2 or the 3 tests. Then, things got tough. We went in for a sonogram and the technician stopped at one point and said, “I need to get the doctor.” That’s never a good sign.

When the doctor came back he said, “Well, your child had 2 soft markers for Downs.”

“What does that mean?” we asked.

“Well, it means that your child has a higher chance of having Downs Syndrome. Maybe you should see a genetic counselor,” he said.

“Before we go down that route, how does this really alter our chances?” we asked.

“Well, we’re not really sure. One soft marker could double the chance of having Downs Syndrome. So 2 soft markers might increase the chance by as much as 4 times but it’s probably less than that,” he said.

“So you’re saying our chances are back to 1 in 1700.”

“Yes.”

See. Magic.

How did this happen? Behavioral Economics has an answer. In contrast to typical economic theory, Behavioral Economics looks at situations and sees how people really react — not how they would react in theory. The situation above is an example of Prospect Theory — the finding that losing something causes about twice as much pain as the pleasure you get from gaining something. So gaining and then losing the same amount still feels like a net loss.

“Saving Money” by Paying More for Netflix

In an earlier piece, I talked about how NetFlix’s move to a subscription model. Why was this subscription model so important?  It made me think of some research that I did on how people think about money about 10 years ago.

One of the big findings was that people have good and bad ways to spend money. For instance, spending money on the house and paying it off every month is a good thing. Having credit card debt every month is a bad thing. But you can get people to make credit cards bills into good payments when you show them how many airline miles they have “earned.” Notice how credit card companies use the word “earned” as opposed to “purchased.”

Getting back to NetFlix, I remember talking to a middle class couple about their finances about 10 years ago. We were in suburban Chicago and sitting on their back porch on a warm spring day. We talked about how they were saving money. They started with normal things like eating out less and spending less money on clothes. But then they said, “We have Netflix so we’re saving money that way too..”

So I asked, “Do you watch a lot of movies?” figuring that they had calculated the cost of renting from Blockbuster vs. their Netflix subscription. This was before streaming. You got one DVD at a time but you could exchange it as many times as you wanted over a month.

“No. Not really,” they said. “But having Netflix means that we don’t have to rent DVDs anymore so that saving money.”

That always stuck with me because I bet they paid a lot more for their Netflix subscription than they ever paid renting videos at Blockbuster. But to them, this was saving money.

The Liars Paradox OR Today Is Not Opposite Day

Today my son Blake started telling me that “Today is Opposite Day!” and then said things like “I love doing my homework. Just kidding. It’s opposite day!”

I told him that he couldn’t possibly be telling me that today is opposite day. If it were opposite day and he was telling me it was opposite day, then it wouldn’t be opposite day. And if it’s not opposite day and he told me that it was opposite day, it would be opposite day. It’s a cycle that never ends. Formally the sentence, “This is opposite day” is neither true nor false and therefore is undefined.

The Infinite Circle of Opposite Day

This, of course, prompted his friend Gabe to try to explain it all to me. “It’s complicated,” he said, “you see, if we say it’s opposite day then we would say that it’s not opposite day to mean that it really is opposite day.” But I didn’t find this line of argument compelling.

Blake tried a different tack, “We can say that Wednesday is opposite day.”

“Yes,” I said, “but you can’t say that on Wednesday.”

This is an ancient logical paradox called the Liar’s Paradox which often takes the form of “I am a liar” or “This sentence is false.” Because the sentence is self-referential and negative.

I figure it’s never to early to teach the kids about logic and paradox. It also makes Blake be more specific about opposite day. The inherent problem with opposite day is that kids randomly choose which items are opposite and which are not (e.g., the sentence “It’s opposite day” is not negated). Now he needs to say “If it were opposite day, I’d say that I love doing my homework.”

Game Theory for Parents

When I was in business school I had a wonderful teacher Adam Brandenberger who wrote a book called Co-Opetition. The book is chock full of lessons on how to apply mathematical game theory to business.

In the book, I learned how to fairly divide things between two companies. But it also works for dividing things between my two kids without them getting upset (formally called Envy Free division). If you have kids, you know that this is a non-trivial problem. Let’s use the example of a cupcake. The most obvious thing is to split the cupcake in half and distribute the two equal pieces to the children. Of course, because you can never cut the cake directly in half, one of the kids is going to complain of unfairness.

The book explains a better way called I cut, you choose . One child (normally the older one who’s better with a knife) makes the cut and the other child gets to choose the one he prefers. This forces the cutter to create two pieces that are as close to as equal as possible because he knows that he’s going to get the piece that’s second best.

This worked well and inspired me to try other systemic solutions to child problems. Here’s the way I solve the problem of two kids sharing an iPhone (or iPad) when watching a movie. Normally the child who’s holding the iPhone will slowly and unconsciously move the phone closer to him, ignoring his sibling. Eventually, the phone gets so far away that I hear, “Hey, I can’t see the phone!”

I’ve been able to solve this problem by having each child have one hand on the phone. Instead of one child controlling the phone, they are sharing control of the phone. This imperceptible pulling between the two children tends to leave the phone nicely spaced between them. You’d think you’d have constant fighting between the two kids — and you do! But the fights are so small that neither kid noticies.

It Works in Practice but Does it Work in Theory OR The Fairy Tale of John Sarno and The Miracle Cure

Once upon a time, there lived two brothers, John and Steve. John was a television reporter for ABC’s 20/20. Steve was on the faculty of Harvard Medical school. They both had horrible back pain.

They’d searched far and wide for a magical solution to cure their back pain.  They tried every contraption and theory supplied by doctors. Unfortunately, nothing helped.

One Contraption That Steve Is Using to Ease His Neck Pain 

One day John met a shaman (doctor) named John Sarno. John Sarno had a magical cure for back pain. If you just said the magic words and believed them, your back pain would be cured. You had to say:

  1. There is nothing wrong with my back.
  2. The pain is all being generated by my head. It’s my brain trying to distract me from the emotional rage that’s I’m feeling based on repressed Freudian memories.

When John woke up the next morning his back pain was cured.

“Steve!” said John, “I have the most amazing news! I found the miracle cure. You just have to say the magic words.”

“John, you know I can’t do that,” said Steve. I’m a doctor and don’t believe in magic. Besides, Freud’s theories on repressed emotions were discredited long ago.”

So John lived happily ever after while Steve stayed in back pain.

(End of fairy tale)

This story is a close adaptation of reporter John Stossel’s segment about John Sarno on 20/20  from 1999. His brother Steve was teaching at Harvard Medical School at the time.

When I first heard the story, I remember thinking that Steve was right. John Sarno was obviously blowing smoke. In the years since then, I’ve realized that it’s not quite so simple.

The basic problem here was that what Sarno said seemed to work even though his theory was tragically flawed. Sarno theory involved Freudian repressed rage — a theory that was discredited a century ago. He was clearly grasping at straws.

But the part that made sense was that the pain wasn’t anatomical. It was coming from your head. This article from Vox does a good job summarizing Sarno. In the article, Cathryn Jakobson Ramin, author of the book Crooked: Outwitting the Back Pain Industry and Getting on the Road to Recovery says, “What he recommended as treatment was essentially cognitive behavioral therapy — elimination of fear avoidant behavior and catastrophizing — before anyone had ever heard of it and it’s exactly what is being used now to treat patients with central sensitization.”

My experience with chronic pain started around 2000. I had horrible wrist pain that wouldn’t go away. Luckily all my doctors told me that I didn’t need surgery. Though they didn’t give me any options of what I should do. Luckily I found Lisa Sattler who is one of the world’s best physical therapists for carpal tunnel. After a year of physical therapy with her, the pain went away.

Though the pain came back, as back pain, a few years later. I found the book Back RX to be very helpful. But as the back pain persisted, I got an X-Ray that showed a bone spur in my hip, “Aha!” I thought, “I’ll have surgery and pain will go away.”

“Not so fast,” said the surgeon. “Why don’t we inject some strong painkiller right into your hip. If the pain goes away we’ll do the surgery. If it doesn’t go away, the surgery won’t help.”

So I went into the doctor’s office and lo and behold, the pain didn’t go away with the painkiller. That got me thinking about Doctor Sarno again. I started to realize that the more stressed I became, the more my back hurt. Also, the pain would move around a lot which doesn’t make a lot of sense from an anatomical perspective. At this point in my life, I still don’t think Sarno’s theory makes sense; however, if I sit down and meditate, I can make most of my soft tissue pains go away.

Stories of Great Product Managers

Marty Cagan is one of the great Product Management Gurus. His book Inspired is the bible of product managers. Product Management is a field with lots of advice and best practices but light on practical examples. In this talk, Behind Every Great Product (summary here), Marty shows what how great product managers deliver great products. He gives 6 examples of product managers who embody the art and science of Product Management.

Product Management is really about two things: defining what you want the product to be and then executing against it. Jeff Bezos says it’s about being “stubborn on vision and flexible on the details.”

Marty tells the story of 6 great product leaders who are stubborn on vision and flexible on details. They all see a huge opportunity, run into unexpected challenges and manage to come out of it with flying colors.

Take, for example, the story of Kate Arnold. In 1999, Kate was a Product Manager at Netflix. At the time, Netflix was renting DVDs by mail. They were being pummeled by Blockbuster, the industry’s 800 pound gorilla. Kate worked to move Netflix to its first subscription model. While the model worked great, it created a problem because now everyone wanted to borrow the newest (and expensive to Netflix) movies, putting Netflix in the red. So Kate needed to convince people to watch some older movies. She created classic features like recommendations and queuing that got people to watch a mix of both old and new movies they loved, making them happy and allowing Netflix to stay profitable.