We’ve had a perfect record with production deployments where I work — never having to roll back after deploying an upgrade. Sure, we’ve hit speed bumps, but they were always able to be figured out and corrected before we got to the busy part of the day.

The reaction I’m almost sure you’re having is: “Awesome! Great job!”

Certainly that’s one point of view. Not failing is a good thing.

But it’s also expensive.

If you’re working on life-critical systems then the cost of failure is very high: people might die. If you’re engineering a plane you want and need to make sure everything is perfect before you set it loose in the wild. But all that has cost involved. (I, for one, am happy to pay for the more expensive, but safer plane when I travel!)

You have to take into account what is the cost of failure whenever you are tackling a problem. In software, for the most part, that cost is relatively low. Of course there are life-critical systems and systems that run the stock exchange, but those are comparatively rare.

The reaction I’m having now is more along the line of: “Crap, we need to run faster and take more risks!”

By taking more risks you can deliver value to the business faster. If something blows up in your face, just roll back and regroup. This of course requires institutional change — failure is an option. The cost of failure is that you roll back and try again. No hard feelings.

This then shifts a bunch of other things as well:

  • Business buy-in that this is a fair trade in the risk/reward playing field
  • Rollbacks need to be cheap and easy
  • The cost of deployment should be low
  • Accepting that you can never have 100% confidence that you’ve covered 100% of the use cases

If you run fast enough you’ll trip once in a while. If you only take the safest option you’d never get out of bed in the morning. Accept that you might stumble now and then and you can run.