DevOps

Today’s reader question comes from Marc about my article on Chaos Monkeys and the Simian Army.

I can’t speak for your mother-in-law, but I find this fascinating. Do the testing of problems actually slow down the system, much like as if I were changing a flat tire every week?

— Marc

What A good question Marc! It’s a question that many people have but rarely ask. This type of testing does slow down the system a little, but the benefits outweigh the costs. It’s like asking “Doesn’t sleeping 8 hours a day make you less efficient? Wouldn’t you be more efficient if you worked the whole 24 hours?”

The key is that failure is baked into this model. Think about if you have 1000 wheels on the car instead of 4. Now each of these wheels is rated to be replaced every 3 years. So each day you can expect about one tire to go flat a day. But that’s on average.

One wheel breaking every day is a pretty easy thing to recover from. But if you have 1000 tires, some crazy things could happen that are very hard to predict ahead of time. What happens if multiple tires go out? What happens if three tires go out that are next to each other? What happens if the front right tire and the front left tire go out at the same time?

It’s much more complicated in Netflix’s case because you have many different types of systems that are interdependent. That’s why Netflix tests all these different contingencies. Yes, there’s a slight overhead in doing this but it allows you to ensure that the system is robust. Also, Netflix wants to make sure that if any single component fails, the system degrades gracefully. For example, if the recommendations system goes down, Netflix should display generic recommendations like new movies or fan favorites and everything else should work fine.

What we’re really talking about is humility in our ability to design a system perfectly up front. In order to run a system at 100% optimal efficiency, you’d need to be able to predict everything that could go wrong and also what may unexpectedly change in the future. For a long time, people have worked hard at making this planning process better. However, trying to make the planning process perfect starts to take more and more time and cuts into the efficiency of the project. Also, and this may be obvious, it’s impossible to plan perfectly for the future.

This is why most software development is moving from a traditional “waterfall” design to a more “agile” design. People used to think that you should build software like you build a building. This was called “waterfall” because you start at the top with your strategy and that design flows down all the way into execution. You make highly detailed plans and then take years to build it. However, we’ve realized over the years that we can solve most key business problems without building the whole software project — we just build the parts that matter. Also, people can start using the software before it’s done — which lets us revise the plans on a regular basis as we see how it’s used. We’re accepting that we don’t know the total plan. We have an idea of where we want to go in five years, but we only know what we’re building over the next few months.

But why does agile work? Isn’t it more efficient to do all the planning first and then build it? Yes. But not in a way you think. The waterfall method is more efficient for software; however, it’s not what’s best to get the job done. When people start a project, they don’t actually know everything that the product needs to do up front. They wish they could know 100% of the system in the beginning but they never actually do that. Think about trying to design a user interface five years ago. Would you even have considered building a voice interface like Amazon’s Alexa? Of course not. So if you built your 5-year-product roadmap, it wouldn’t have even considered a voice interface.

This idea of breaking projects down into small parts goes beyond software. Bent Flyvbjerg (researcher in project planning with a super awesome name) has found that larger projects are more likely to have cost overruns. However, it’s not necessarily about the size of the project itself, it’s much more related to the size of the segment of the project. Public works projects like building a bridge or a dam, which can only be built in one large chunk, are more likely to have cost overruns than a road, which can be built in small segments.