Important to: VP Operations, CTO
Definition:When a production failure occurs, how long does it take to recover from the issue?
How to measure: Different between systems. Common metric is production downtime avg over last ten downtimes
Expected outcome: MTTR should become lower and lower as DevOps maturity grows.
MTTR vs Mean time to Failure
To understand MTTR, we have to first understand its evil brother: MTTF or "Mean time to failure" which is used by many organizations' operational IT departments today.
MTTF means "How much time passes between failures of my system in production?".
Why is MTTR more valueble?
There are two arguments in favor of putting MTTR as a higher priority than MTTF (we don't want to ignore that we fail often, but it's not as important as MMTR).
If Amazon.com was down once every three years, but it took them a whole day to recover, consumers won't care that the issue has not happened for three years. All that will be talked about is the long recovery time. But if Amazon.com was down for 3 times a day for less than one second, it would barely be noticeable.
Let's consider the incentive MTTF creates on an operations department: The less often failure happens in the first place, the more stable your system is, the more bonus you get at the end of the year in your paycheck.
What's the best way to keep a system stable? Don't touch it!
Avoid any changes to the system, release as little as possible, in a very controlled, waterfall-ish manner to make sure releasing is so painful that the person on the other end really really wants to release, and would go through all the trouble of doing so.
This "Stability above all else" behavior goes exactly against the common theme of DevOps: release continuously and seize opportunities as soon as you can.
The main culprit for breeding this anti-agile behavior is a systematic influence problem: What we measure influences people into doing the behavior that hurts the organization (You can read more about this in my book "Elastic Leadership" in the chapter about "Influence Forces" (here is a blog post that talks about them in more details).
Developers can't rest on their laurels and claim that operations are the only ones to blame for slowing down the continuous delivery train here. Developers have their own version of "Stable above all else" behavior which often can be seen in their reluctance to merge their changes into the main branch of source control (where the build pipelines get their main input from).
Ask a developer if they'd like to code directly on the main branch and many in enterprise situations will tell you they'd be afraid of doing so in the fear of "breaking the build". Developers are trying to keep the main branch as "stable" as possible so that the version going off to release (which is in itself a long and arduous process as we saw before), has no reason to come back with quality issues.
"Breaking the build" is the developer's version of "mean time to failure", and again, here the incentives from management are usually the culprit. If middle managers tell their developers it's wrong to break a build, then they are driving exactly the same fear as operations have. The realization that "builds are made to be broken" is a bit tough to swallow for developers who fear that they will hold up then entire pipeline and all other teams that depend on them.
Again, the same thought here applies: Failure is going to happen, so focus on the recovery aspect: how long does it take to create a fix for something that stops the build? If you have automated unit tests, acceptance tests and environments, and you're doing test-driven development, fixing an issue that stops the pipeline can usually be a matter of minutes: code the fix, along with tests, see that you didn't break anything, and check it into the main branch.
Both operations and development have the same fear: don't rock the boat. how does that fit in with seizing opportunities as quickly as possible? How does this support "mean time to change" to be as fast as possible?