Always start with a "Why" when transforming to DevOps.

I’ve been to quite a few DevOps and Agile transformations, big and small. Some just want to up the coverage of their tests (not recommended!), others want to revamp their team structures, and some want to change their process.

There are other points of view as well: Many internal and external consultants who want to make a difference try changing things but end up failing for many reasons.

My advice, which I’ve learned the hard way (like anything of true value) - is to have a good “Why” in your head at all times.

“Why am I doing this?”, “What’s the point of this change?”, “What are we trying to fix by changing this?” , “What is the cost of NOT changing this?”.

You have to have a really good answer for all of these, which are just a variation of the age old “What’s your motive?” question.

Not only do you need to have a good answer, you need to fully believe that this answer is what would really drive things for the better. If you’re aiming to change something and you yourself are not sure there’s any point in doing this, or tht it could actually succeed, if you can’t even imagine a world where that success has been accomplished, a realizty where things have changed (be it a month from now or 5 years from now) - it’s going to be very difficult.

Change always has detractors and naysayers. And even though from their point of view there might be truth to their reasons for not agreeing with you, their truth should not hinder your own belief that this change is needed.

Where this “Why” comes in handy the most, for me, is when someone comes with a really good argument for “Why not”. You can’t know everything and you can bet on being surprised at some of the good reasons people give why things shouldn’t be done or cannot. Or why they think things will not succeed, or are doomed to make things worse. It’s really handy because at those dark times, where you’re even unsure in yourself that this is the right thing to do, you can come back to your original “Why”. And remind yourself why this makes sense to you. And you can use that “Why” to measure any fact you encounter, where it still holds true or not. If it no longer holds true - you might have truly discovered or learned something new and would need need to possibly change strategy or goals.

Here’s an example of “Why” that I have used:

“Developers should have fast feedback loops that enable them to move fast with confidence”.


  • Teach about single trunk branching technique

  • Teach about Feature Flags

  • Help team design a new branching model that reduces amount of merges

  • Help team design pipeline that triggers on each commit

  • Get the pipeline to run fast, in parallel, on multiple environments

  • Teach team about unit testing, TDD and integration tests

That shoudl get a team far enough on that “Why”. And here’s what happens in real life, especially in large organizations:

Every bullet point above will be challenges, cross examined, denounced as the devil, offered multiple variations that seem to be close but are not, and very few people will be able to automatically support it in many teams.

But this “Why” is what keeps my moral compass afloat. No matter what is the discussion - and especially if it becomes confusing, or goes deep into alternative suggestion land - I keep asking myself if that “Why” is being answered. And I can use it as my main reason for each one of these things, because it is absolutely true in my mind - these points help push that “Why” - and doing them in other ways might seem like it’s close but can push us farther away from it.

Always have a “Why” - and you’ll never be truly stumped, even when confusion takes over.

You could have many “Whys” - and not all of them will be in play at all times. But for each action you push for, there should be a “Why” behind it.

In my mind it reminds me a bit of TDD - any production code has a reason to exist: a failing test at some point that pointed out that this functionality is missing. Tests are the “Why” of code functionality.

In DevOps and Continuous Delivery, a big “Why” can be “We want to reduce our cycle time. “ Then from that you can derive smaller “Whys”, but they all live up to the big one, and support it.

Ten Devops & Agility Metrics to Check at the Team Level

When I coach teams that are getting into the DevOps and Continuous Delivery mindset, a common question that comes up is "What should we measure?"

Measuring is a core piece of change - how do you know you're progressing without measuring anything?

Here are ten ideas for things you can measure to see if your team is getting closer to a DevOps and continuous delivery skillset. It's important to realize that what we are measuring are end symptoms - results. The core behaviors that need to change can be varied quite a bit, but at the end of the day, we want to see real progress in things that matter to us from a continuous delivery perspective.


  1. Cycle time. (you want to see this number going down)  If you put a GoPro on a user story, from the moment it enters the mind of a customer or PO, and track what it goes through, to the point of being active in production, you get a calendar-time number that represents your core delivery cycle time. It could take weeks, months and sometimes years in large organizations. It usually id a big surprise. I'll write about this more in a separate blog post.  The idea is to see cycle time reduced over time, so you actually deliver faster and be more competitive.
  2. Time from red build to green build (you want to see this number going down) - Take the last instances of a red-to-green build (count from the first red build, until the first green build after that) to get how long on average it takes to make a red build green. This is how effective your team is with dealing with a build failure. Build failures a re a good thing - they tell us what's really going on. We should not avoid them. But we should be taking care of them quickly and efficiently (for example you can set up "build keeper" shifts -every day someone else is in charge of build investigations and pushing the issue to the right people in the team.
  3. Amount of open pull requests on a daily basis, closed pull requests, coupled with the avg. time a pull request has been open. (you want to see closed requests going up, request time going down and open requests being stable or going down).  This gives us a measure of team communication and collaboration - how often does code get reviewed, and how often is code code stuck waiting for a review. A trend of open pull requests going up could mean the team has a bottleneck in the code review area. The same is true for very long pull request times.
  4. Frequency of merges to Trunk. (this should be going up or staying stable) If your code gets merged to trunk every few days or weeks, it means that whatever it is your build pipeline is building and delivering is days old or weeks old code. It also is a path to many types of risks such as: not getting feedback fast enough on how your code integrates with everyone else, your code not being deployed and available to turn on with a feature flag, and generally it's a pathway for people who are afraid of exposing their work to the world, thus potentially creating hours and sometimes days of pain down the line. 
  5. Test Code Coverage (coupled with test reviews) (you want to see this go up or stay stable at a high level, while watching closely for quality of code reviews). I always like to say that low code coverage means only one thing - you are missing tests. but high code coverage is meaningless unless you have code reviewed, because human nature leads us to fulfill whatever we are measured on. so sometimes you can see teams writing tests with no asserts just to get high code coverage. this is where the code reviews come in.   
  6. Amount of tests (this should obviously be going up as you add new functionality to your product).
  7. pipeline run time . (this should be declining or staying at a low level). The slower your automated build pipeline is the slower your feedback is.  This helps you know if the steps you are taking also help increase the feedback cycle.
  8. pipeline visibility in team rooms (you want to see this go up or stay stable at a high level). This is a metric that tells you about commitment to visual indicators, information radiators etc. It's a small but important part of team non verbal communication and increases the team's ability to respond quickly to important events. 
  9. team pairing time (should be going up or stay stable at a medium or high level) - we can measure this to see if we have knowledge sharing going on.
  10. amount of feature flags -(should be going up as team learns about feature flags, and then stay stable. if it continues to increase it means you're not getting rid of feature flags fast enough which can lead to trouble down the line. 

Two bonus metrics:

  1. feature size estimate (should be staying stable or going down) - helps to track how well the team estimates feature sizes or to check the variance of the feature sizes you estimate.
  2. Bus factor count - (should be going down and staying down) how many people are bus factors

Ephemeral Environments for DevOps: Why, What and How?

In the previous post I discussed the issues with having static environments. In this post I cover one of the solutions for those issues: Ephemeral environments.

Ephemeral environments are also sometimes called “Dynamic environments”, “Temporary environments”, “on-demand environments” or “short-lived environments”.

The idea is that instead of environments “hanging around” waiting for someone to use them, the CI/CD pipeline stages are responsible for instantiating and destroying the environments they will run against.

For example, we might have a pipeline with the following stages:

  1.       Build
  2.      Test:Integration&Quality
  3.      Test:functional
  4.      Test:Load&Security
  5.      Approval
  6.      Deploy:Prod
(Image courtesy of  cloudbees )

(Image courtesy of cloudbees)

In a traditional static environment configuration, each stage (perhaps except the build stage) would be configured to run again a static environment that is already built and is waiting for it, or some steps might share the same environment, which causes all the issues I mentioned previously.

In an ephemeral environment configuration, each relevant stage would contain two extra actions: one at the beginning, and one at the end, that spin up an environment for the purpose of testing, and spin it down at the end of the stage. 

The first step(1) is to compile and run fast unit tests, followed by putting the binaries in s aspecial binary repository such as artifactory.

There is also an stage (2) that creates a pre-baked environment as a set of AMIs of VM images (or containers) to later be instantiated. :

  1.      Build & Unit Test
    1.       Build Binaries, run unit tests
    2.       Save binaries to artifact managemnt
  2.       Pre-Bake Staging Environment
    1.       Instantiate Base AMIs
    2.       Provision OS/Middleware components
    3.       Provision/Install application
    4.       Save AMIs for later instantiation as STAGING environment (in places such as S3, artifactory etc.)
  3.      Test:Integration&Quality
    1.       Spin up staging environment
    2.       Run tests
    3.       Spin down staging environment
  4.      Test:functional
    1.       Spin up staging environment
    2.       Run tests
    3.       Spin down staging environment
  5.       Test:Load&Security
    1.       Spin up staging environment
    2.       Run tests
    3.       Spin down staging environment
  6.      Approval
    1.       Spin up staging environment
    2.       Run approval tests/wait for approval and provide a link to the environment for humans to look into the environment
    3.       Spin down staging environment
  7.      Deploy:Prod
    1.       Spin up staging environment
    2.       Data replication
    3.       Switch DNS from old production to new environment
    4.      Spin down old prod environment (this is a very simplistic solution)

A few notes:

·       Pre-Baked Environment:

Notice that the environment we are spinning up and spinning down is always the same environment, and it is a staging environment with the application pre-loaded on top of it.


Staging environments are designed to look exactly like production (in fact, in this case, we are using staging as a production environment in the final stage).

The reason we are always using the same environment template, is because:

  •    This provides environmental consistency between all tests and stages, and removes any false positives or negatives. If something works or doesn’t work, it is likely to have the same effect in production.
  •    Environments are pre-installed with the application, which means that we are always testing the exact same artifacts, so we get artifact consistency
  •    Because environments are pre-installed, we are also in-explicitly testing the installation/deployment scripts of the application.

 Only one Install.

Also notice that there is no “installation”  after the pre-baking stage. – which means we also don’t “deploy” into production. We simply “instantiate a new production in parallel”.

We “install-once, promote many” which means we get Installation consistency across the stages.

Blue-Green Deployment

Deploying to production just means we instantiate a new pre baked environment in the production zone (for example a special VPC if we are dealing in AWS) which would run in parallel with the “real” production. Then we slowly soak up production data, let the two systems run in parallel, and eventually either switch a DNS to the new server, or slowly drain the production load balancer into the new server (there are other approaches to this that are beyond the scope of this article.


Another advantage of this set up is that because each stage can have its own environment, we can run some stages in parallel, so, in this case we can run all the various tests in parallel, which will save us valuable time:

  1.  Build & Unit Test
    1.       Build Binaries, run unit tests
    2.       Save binaries to artifact managemnt
  2.       Pre-Bake Staging Environment
    1.       Instantiate Base AMIs
    2.       Provision OS/Middleware components
    3.       Provision/Install application
    4.       Save AMIs for later instantiation as STAGING environment (in places such as S3, artifactory etc.)   
    1. Test:Integration&Quality
      1.       Spin up staging environment
      2.       Run tests
      3.       Spin down staging environment
    2.      Test:functional
      1.       Spin up staging environment
      2.       Run tests
      3.       Spin down staging environment
    3.       Test:Load&Security
      1.       Spin up staging environment
      2.       Run tests
      3.       Spin down staging environment
  4.      Approval
    1.       Spin up staging environment
    2.       Run approval tests/wait for approval and provide a link to the environment for humans to look into the environment
    3.       Spin down staging environment
  5.      Deploy:Prod
    1.       Spin up staging environment
    2.       Data replication
    3.       Switch DNS from old production to new environment
    4.      Spin down old prod environment (this is a very simplistic solution)


One tool to look into for managing environments and also killing them easily later would be Chef:Provision, which can be invoked from the jenkins command line, but also saves the state later for spinning down an environment. It also follows the toolchain values we discussed before on this blog.

The four biggest issues with having static environments

An environment is a set of one or more servers, configured to host the application we are developing, and with the application already installed on them, available for either manual or automated testing.

What are static environments?

Static environments are environments that are long lived. We do not destroy the environment, but instead keep loading it with the latest and greatest version of the application.

For example, we might have the following static environments:

“DEV”, “STAGE” and “PROD”.

Each one is used for a different purpose and by different crowds. What’s common about them is that they are all long lived (sometimes for months or years), and this creates several issues for the organization:

1.     Environment Rot: As time goes by, the application is continuously installed on the environments and configuration is done on them (manually). This creates an ongoing flux of changes to each environment that leads to several problems

a.     Inconsistency between environments (false positives or negatives)

  • Any deployment or tests results you get in one environment may not reflect what you actually get in production. For example, tests that pass in “Model” could be passing due to a specific configuration in MODEL that does not exist in other environments, meaning we’d get a false positive.
  • Bugs that happen in one of the environments might not happen in production, which is a false negative.

b.     Inability to reproduce issues between environments

If a bug is manifested in one environment but cannot be reproduced in another, due to the fact that the environments are different “pets”

2.    Long and costly maintenance times

Because environments are treated as “pets” (i.e you name them, you treat them when they are sick, each one is a unique snowflake that has its ups and downs), it takes a lot of time and manual, error prone activities, to maintain the environment and bring it up if it crashes.

This causes a delay whenever a team needs to test the product on an environment.

3.    Queuing

Because of teams waiting to deploy to an environment, and because there is a limited number of these environments (because they are costly to set up, maintain and pay for to keep running 24-7), queues start to form as teams await environments to become available.

This queueing can also be caused because multiple teams are expecting to use the same environment, and so each team waits for other teams to finish working with the environment before they can start working.

This causes release delays.

4.    Waste of money

 Static environments usually run 24-7, and in a private or public cloud scenario this might mean paying per hour per machine or VM instance. However, most environments are only used during work hours, which means in many organizations up to 16 hours of idle paid time.


In the next post I'll cover ephemeral environments and how they solve the issues mentioned here.

Continuous Delivery Values

I mentioned in my last post about the toolchain needing to respect the continuous delivery values. What are those values? Many of them derive from lean thinking, eXtreme Programming ideas and the book "Continuous Delivery".

Michael Wagner, a colleague and mentor of mine at Dell EMC,has described them as:

The core value is : 

Our highest priority is to satisfy the customer through early and continuous delivery of valuable software


The principles are:

  • The process for releasing/deploying software MUST be repeatable and reliable
  • Automate Everything down to bare metal
  • If something’s difficult or painful, do it more often
  • Keep EVERYTHING in source control
  • Done means “released”
  • Build quality in
  • Everyone has responsibility for the release process
  • Improve Continuously

The Practices are:

  • Build binaries (or image, or containers, or any artifact that will be deployed) only once, promote many
  • Use precisely the same mechanism to deploy to every environment
  • Smoke test your deployment
  • Organizational policy is implemented as automated steps in the pipeline
  • If anything fails, stop the line


Guidelines for selecting tools for continuous delivery toolchains

If the tool you use does not support continuous delivery values, you're going to have a bad time implementing CI/CD with fully automated pipelines.

Here are some rules for the road:

  1. The first rule is: Don't select your toolchain until you have designed the pipeline you want to have
  2. Every action or configuration can become code in source control so you can version things and get an audit trail on changes
  3. Everything that can be invoked or configured has an automation end point through command line or APIs
  4. Every command line can return zero for success or non zero for failures
  5. If you have to log in to a UI to be able to configure or execute something, you're going down the wrong path -  Your CI/CD tools (like Jenkins) is your user, not humans.
  6. Queues should be avoidable: if you can only do one task at a time, but you have multiple builds that need to use the tool, you'll have a queue. The tool should be able to support parallel work by multiple pipelines or any other support for avoiding such queuing - as long as it enables pipelines to run faster, not slower, but still give all the feedback you need into the pipeline.
  7. Results should be easily visible in the pipeline, or importable via API or command line: You need to be able to see the results easily in the pipeline log so that you can easily understand pipeline issues without going through different teams and tools.

The first rule of continuous delivery toolchains

Continuous delivery transformations are hard enough in many areas already: changing people's behavior, changing policies, architecting things differently - all these things are difficult on their own.  

Having a toolchain that prohibits you from achieving your pipelines in a way that fully supports CI/CD is something you can try to avoid off the bat. 

The simple rule is:

Don't select your toolchain before you've designed the pipeline you wish to have

Don't decide you're going to use a specific tool for the new way of working before you've done at least the following:

  • Choose a project in your organization that exemplifies an "average" project (if there are many types choose two or three).
  • Do a value stream mapping of that project from conception to cash  - from the point of an idea, to that getting into a backlog, to that code ending up in production. Then Time the value stream and see how long each step takes.
  • Once you have a value stream, design a conceptual pipeline that tries to automate as much of the value stream as possible (some things will not be able to be automated without the change of organizational policies)

Now that you have  conceptual pipeline (or several), you can start deciding which tools fit best for the actions you would like to pull off in the pipeline.

What you'll find is that some tools will not lend themselves easily into the pipeline work, and some tools will fit in just right.

Remember: the pipeline drives the tool selection. Not the other way around.

You should not let the tool dictate how your pipeline works because you're trying to change the way the organization works, not map the organization to a tool that might hider your progress.



The four most common root causes that slow down enterprise continuous delivery

In my journey as a developer, architect, team leader, CTO, director, coach and consultant for software development, the two most common "anti patterns" I come across int he wild that generate the most problems are:

  1. Manual Testing
  2. Static Environments
  3. Manual configuration of those environments
  4. Organizational Policies


Manual Testing

This one seems pretty straight forward: Because testing is manual, it takes a really long time. Which slows down the rate of releases as everyone waits for testing to be done.

There are other side effects to this approach though:

Because testing is manual:

  • It is hard to reproduce bugs, or even the same testing steps continuously
  • It is prone to human error
  • It is boring and creates frustration to the people who do it (which leads to the common feeling among people that 'QA is a stepping stone to becoming a developer' - and we do not want that! We want testers who love their job and provide great value!)
  • The list of manual tests grows exponentially - and so does the manual work involved. as more features are added, you get very low code coverage, since people can only spend so much time testing the release before the company goes belly up.
  • It creates a knowledge silo in the organization, a "throw it over the fence" mentality for developers who are only interested in hitting a date, but not wether the feature quality is up to par ('not my job"). 
  • This knowledge silo also creates psychological "bubbles" around people in different groups, that 'protect' them from information of what happens before and after they do their work. Essentially meaning that people stop caring or have little knowledge of how they contribute and where they stand on the long pipeline of delivering software to production ("Once I sign off on it, I have no idea what is the next step - I just check a box in "ServiceNow" and move on with my life" ) .
  • You never have enough time to automate your tests, because you are too busy testing things manually!

So manual testing is both a scaling issue, a consistency issue and a cause for people leaving the company.

Static Environments

An environment is usually a collection of one or more servers sitting inside or outside the organization (private or public cloud, and sometimes just some physical machines the developers might have lying around if they get really desperate).

In the traditional software development life cycle that uses pipelines, code is built, and then promoted through the environments, that progressively look more and more like production, with "Staging" usually being the final environment that stands between the code and production deployment.

How it moves through the environments changes: some organizations will force to merge to a special branch for each type of environment (i.e "branch per promotion"). This is usually not recommended, because it means that for each environment the code will have to be built and compiled again, which breaks a cardinal rule of CI/CD: Build only once, promote many - that's how you get consistency. But that's a subject for another blog post.

More problematic is the fact that the environments are "static" in the first place. Static means they they are long-lived: an environment is (hopefully) instantiated as a set of virtual machines somewhere in a cloud, and then is declared "DEV" or "TEST" or "STAGE" or any other name that signified where it is supposed to be used in the organizational pipeline that leads to production. Then a specific group of people tend to get access to it and use it to (as we said in the first item) test things manually.

But even if they test things with automated tests,

The fact that the environment is long lived is an issue for multiple reasons:

  • Because they are static, there is only a fixed number of such environments (usually a very low number), which can easily cause bottlenecks in the organization: "We'd like to run our tests in this environment, but between 9-5 some people are using it"  or "We'd like to deploy to this environment, but people are expecting this environment to have an oder version"    
  • Because they are static, they become "stale" and "dirty" over time:  Each deploy, or configuration change becomes a patch on top of the existing environment leading it to become a unique "snowflake" that cannot be recreated elsewhere. This creates inconsistency between the environments, which leads to problems like "The bug appears in this environment, but not in that environment, and we have no idea why"
  • It costs a lot of money: An environment other than production, usually is only utilized in times when people are at work, otherwise it just sits there, ticking away CPU time on amazon , or electricity on prem, and is a waste of money.  Imagine that you have a fleet of 100 testing machines in an environment that are utilized in parallel during load testing, but that only happens once a day, or three times a day, but 12 hours a day these 100 machines just sit there ,costing money. 

So static environments are both a scale issue and a consistency issue.


Manual Configuration of environments

In many organizations the static environments are manually maintained by a team of people dedicated to the well being of those snowflakes: they patch them, they mend their wounds, they open firewall ports, they provide access to folders and DNS and they know those systems well. 


Manual configuration is an issue for multiple reasons:

  • Just like manual testing, it takes a lot of time to do anything useful with an environment : bringing it up and operationalizing it takes a long time and painstaking dedication over multiple days sometimes weeks of getting everything sorted with the internal organization. Simple questions rise up and have to be dealt with: "who pays for this machine? Who gets access? what service accounts are needed? how exposed is it? Will it contain information that might need encryption? Will compliance folks have an issue with this machine being exposed to the public internet?  - these and more have to be answered, and usually be multiple groups of people within a large organization.   The same with any special changes to an existing environment, or, god forbid, debugging  stray application that's not working on this snowflake, but does work everywhere else - a true nightmare to get such access if you've ever been involved in such an effort.    
  • Not only is it slow, but it also does not scale:  Onboarding new projects in the organization will take a long time - if they need a build server, or a few "static" environments: that's days or weeks away. So this slows the rate of innovation. 
  • There is no consistency between the environments, since all things are done manually, and people often find it hard to repeat things exactly. No consistency: the feedback you get from deploying onto that environment might not be true, and thus the time to really know if your feature actually does work might be very long: maybe you'll find out only in production, a few weeks or months from now. and we all know that the longer you wait to get feedback, the more it costs to fix issues.
  • Importantly: there is no record of any changes made to environment configuration: all changes are manual so there is no telling what changes were made to an environment, as well as rolling back changes is difficult or impossible. 

So manual configuration of environments hurts time to market, consistency, operational cost and the rate of innovation.


Organizational Policies

If your organization requires manual human intervention to approve things that go into production , for everything then this policy can cause several issues:

  • No matter how much automation you will have, the policy that drives the process will force manual intervention and automation will not be accepted (especially auto-deploy to production)
  • It will slow down time to market
  • It will make small changes take along time, and large changes take even longer (if not the same)

Here is the guidance I usually offer in these cases:

If you are doing manual testing

you probably want to automate things but you either don't have time to automate because you are too busy doing manual tests, or you might have people (not "resources" , "people"!) that do not have the automation skill yet:

If you don't have time to automate

you are probably in Survival Mode. You are over committed. You will have to change your commitments and have a good hard talk with your peers and technical leadership about changing your commitments so you get more time to automate, otherwise you are in a vicious cycle that only leads to a worse place (the less time you have to automated the more you test manually, and so you get even less time to automate)

If your people are missing the skills:

coach them and give them time to learn those new skills (or possibly hire an expert that teaches them)



If you are using static environments

- look into moving to Ephemeral environments: environments that are spun up and torn down automatically before and after a specific pipeline stage, such as testing.



If you are configuring your static environments manually:

Look into tools such as Chef Provisioner, Terraform or Puppet to automate the creation of those environments in a consistent way that is fully automated, and that also keeps configuration current. This also solves the issue of having no record of any configuration changes to environments, since all changes are in the form of changing a file in source control, and having the tool take care of the provisioning and configuration for you. Less human error, auditing, versioning: who wouldn't want that? It's the essence of infrastructure as code: treating our infrastructure the same way we treat our software: through code! This also gives you the ability to use tools to test your environments and configuration with tools such as Chef Compliance, InSpec and others.



If you have organizational policies that prevent automated deployment:

  • Look into creating a specialized, cross functional team of people within the organization, that is tasked with creating a new policy that allows for some categories of application changes to be "pre-approved".
  • This will enables pipelines to deploy changes that fall into this category direction to production without going through a very complicated approval process that takes a long time.
  • For other types of changes automate as much as you can whatever compliance, security and approval checks (yes, even ServiceNow has APIs you can use, did you know?), and then put them into a special approval pipeline, that removes as much of the manual burden on the change committees as possible, so that the meetings they have become more frequent, and faster "we want to make this change, it has already passed compliance automated tests and security tests in a staging environment - are we good to go?"
  • For changes that affect other systems in the organization, you can look into creating a special trigger of those external pipelines that get deployed with your new system in parallel to a test environment, and then make tests are run on the dependent system to see if you broke it. Pipelines that trigger other pipelines can be an ultimate weapon to detect cross-company changes with many dependencies. I will expand on this in a later post.








    DevOps Metric: Amount of Defects and Issues: definition.

    • Important to: VP of QA, Operations
    • Definition: How many bugs or tickets are found in production/ how many recurring bugs in production. As well as - how many deployment issues during release are encountered.
    • How to measure: tickets from customers, support records. For deployment issues: post release retrospectives.
    • Expected Outcome: Amount of deployment issues, defects and recurring defects should decrease as DevOps maturity grows.

    Many organizations make an almost-conscious choice: go fast, or build with high quality. Usually they lose on both ends.

    Measuring defects and deployment issues is a good measure for a DevOps maturity model. You should become faster while not compromising on quality, because quality gates are fully automated in the form of tests in multiple steps, instead of manual error prone steps.

    The same applies for error prone deployments: they almost stop happening, and when they do happen, the deployment script itself is fixed and can be tested before being used to deploy to production. Just like regular software development life cycles. The ROI is huge.

    DevOps Metric: Frequency of Releases Definition

    • Important to: Release Manager, CIO
    • Definition: How often an official release is deployed to production, in paying customer's hands
    • How to measure: If no pipeline: release schedule. Otherwise, history data for Pipeline Production deploy step.
    • Expected outcome: The time between releases should become shorter and shorter (in some cases an order of magnitude shorter) as DeOps maturity grows.

    It's one thing for the IT of an organization to create pipelines so fast that they can release at will. The other side of that coin is that of the business. 

    If IT is no longer a bottleneck, it is the business' responsibility to decide *when* to release new versions. 

    In traditional organizations IT is usually too slow to supply and deploy the release as fast as the business requires, so the business is forced to operate at the speed of IT , which might mean 4 release a month, 4 releases a year or even less frequent. 

    In a DevOps culture, every code check in (A 'commit') is a potentially releasable product version, assuming it has passed the whole continuous delivery pipeline all the way up to the staging servers. 

    We can use the "Frequency of release metric to decide "how many releases do we want to do per year, month, week?" We can then measure our progress. 

    As IT becomes faster, we can cut down our scheduled release time. As the pipeline supports more and more automated policy and compliance checks as feedback cycles, more time is removed from manual processes and potential releases become easier to achieve.

    The question "How often should we pull the trigger to deploy the latest potential release that has passed through the pipeline?" shifts from becoming IT's problem to being owned solely by business stakeholders.

    As the schedules become more frequent, the competitiveness in the marketplace becomes better.

    In some organizations (Netflix, or Amazon are a good example), the decision on a release schedule is removed completely and only the pipeline's "flow" of releases that automatically get deployed determines the release frequency, which might happen as often as every minute or even less.

    DevOps Metric: Mean time to recovery (MTTR) Definition and reasoning

    Important to: VP Operations, CTO
    Definition:When a production failure occurs, how long does it take to recover from the issue?
    How to measure: Different between systems. Common metric is production downtime avg over last ten downtimes

    Expected outcome: MTTR should become lower and lower as DevOps maturity grows.

    MTTR vs Mean time to Failure
    To understand MTTR, we have to first understand its evil brother: MTTF or "Mean time to failure" which is used by many organizations' operational IT departments today.

    MTTF means "How much time passes between failures of my system in production?".

    Why is MTTR more valueble?

    There are two arguments in favor of putting MTTR as a higher priority than MTTF (we don't want to ignore that we fail often, but it's not as important as MMTR).

    If was down once every three years, but it took them a whole day to recover, consumers won't care that the issue has not happened for three years. All that will be talked about is the long recovery time. But if was down for 3 times a day for less than one second, it would barely be noticeable.

    Wrong Incentives.
    Let's consider the incentive MTTF creates on an operations department: The less often failure happens in the first place, the more stable your system is, the more bonus you get at the end of the year in your paycheck.

    What's the best way to keep a system stable? Don't touch it!

    Avoid any changes to the system, release as little as possible, in a very controlled, waterfall-ish manner to make sure releasing is so painful that the person on the other end really really wants to release, and would go through all the trouble of doing so.

    Sounds familier?

    This "Stability above all else" behavior goes exactly against the common theme of DevOps: release continuously and seize opportunities as soon as you can.

    The main culprit for breeding this anti-agile behavior is a systematic influence problem: What we measure influences people into doing the behavior that hurts the organization (You can read more about this in my book "Elastic Leadership" in the chapter about "Influence Forces" (here is a blog post that talks about them in more details).

    Developers can't rest on their laurels and claim that operations are the only ones to blame for slowing down the continuous delivery train here. Developers have their own version of "Stable above all else" behavior which often can be seen in their reluctance to merge their changes into the main branch of source control (where the build pipelines get their main input from).

    Ask a developer if they'd like to code directly on the main branch and many in enterprise situations will tell you they'd be afraid of doing so in the fear of "breaking the build". Developers are trying to keep the main branch as "stable" as possible so that the version going off to release (which is in itself a long and arduous process as we saw before), has no reason to come back with quality issues.

    "Breaking the build" is the developer's version of "mean time to failure", and again, here the incentives from management are usually the culprit. If middle managers tell their developers it's wrong to break a build, then they are driving exactly the same fear as operations have. The realization that "builds are made to be broken" is a bit tough to swallow for developers who fear that they will hold up then entire pipeline and all other teams that depend on them.

    Again, the same thought here applies: Failure is going to happen, so focus on the recovery aspect: how long does it take to create a fix for something that stops the build? If you have automated unit tests, acceptance tests and environments, and you're doing test-driven development, fixing an issue that stops the pipeline can usually be a matter of minutes: code the fix, along with tests, see that you didn't break anything, and check it into the main branch.

    Both operations and development have the same fear: don't rock the boat. how does that fit in with seizing opportunities as quickly as possible? How does this support "mean time to change" to be as fast as possible?

    My answer is that measuring and rewarding MTTF above all else absolutely does not support a more agile organization. That fear of "breaking the build" and "keep the system stable" is one of the reasons many organizations fail at adopting agile processes. They try to force "release often" down the throat of an organization that is measured and rewarded when everything stays stable instead of being rewarded and measured on how often can you change and how fast can you recover if a failure occurs.

    MTTR is a very interesting case of Devops Culture taking something well established, and turing it on its head. If DevOps is about build, measure learn, and fast feedback cycles, then it should become an undeniable truth that "whatever can go wrong, will go wrong". if you embrace that mantra, you can start measuring "mean time to recovery" instead.

    The MTTR incentive can drive the following behaviors:

    • Build resilience into the operations system as well as the code.
    • Build faster feedback mechanisms into the pipeline so that a fix can go through the pipeline faster
    • Creating a system and code architecture that minimizes dependencies between teams, systems and products so failures do not propagate easily and deployment can be faster and partial
    • Add and start using better logging and monitoring systems
    • Create a pipeline to drivers a deployment of an application fix into product as quickly as possible.
    • Making it just as easy and fast to deploy a fix as it is to roll back a version

    With mean time to recovery, you're incentivizing people based on how much they contribute to the "concept to cash" value stream, or the "mean time to change" number.

    DevOps Metric Definition: Mean time to change (MTTC) (vs change lead time)

    Also called "time from concept to cash".

    Important to: Everyone, especially the CEO, CIO and CTO.

    How long does it take a new average feature, idea, fix or any other kind of change to get into a paying customer's hands, in production, from the moment of inception in someone's mind. MTTC is what it takes you from the moment you see an opportunity until you can actually utilize it. The faster MTTC is, the faster you can react to market changes.

    How to measure:
     We start counting from moment of the change's inception in someone's head (Imagine a marketing person coming up with a competing idea to that of a competitor's product, or a bug being reported by a customer)

    One way to capture and measure mean time to change is by doing a value streaming exercise, as we will touch on in a later chapter in this book.

    Expected Outcome:  MTTC should become shorter and shorter as DevOps maturity grows. 

    Common Misunderstandings: 
    MTTC is not the same as the often cited "Change lead time" as proposed in multiple online publications.

    Change Lead time (at least as far as I could see) only counts the time from the start of development of a feature, when real coding begins.

    MTTC will measure everything that leads up to the coding as well, which might include design reviews, change committees, budgeting meetings, resource scheduling and everything that stands in the way of an idea as it makes its way into the development team, all the way through to production and the customer.

    From a CEO, CIO and CTO view , MTTC is one of the most important key metrics to capture. Unfortunately, many organizations today do not measure this.

    In many companies I've worked with, MTTC was anywhere from one month to twelve months.

    What is Enterprise DevOps?

    To understand Enterprise DevOps, we have to first define what "regular" DevOps is. Here is my current definition of DevOps. Let's call it "DevOps 0.1"

    What is DevOps? (Version 0.1)

    DevOps is a culture that drives and enables the following behaviors:

    • Treating Infrastructure and code as the same thing: Software
    • That Software is built and deployed to achieve fast feedback cycles
    • Those feedback cycles are achieved through continuous delivery pipelines
    • Those pipelines contain the automated coded policy of the organization 
    • Those pipelines are as human free as possible
    • Humans share knowledge and work creatively with work centered around pipelines 
    • Knowledge silos are broken in favor of supporting a continuous pipeline flow

    DevOps is the Agile operating model of modern IT, and is centered around the idea of a continuous delivery pipeline as the main building block of an Agile organization.

    DevOps Metrics

    DevOps is based on lean development concepts, as well as the lean startup movement's ideas. As such, the "build, measure, learn", the scientific method mantra, is very relevant to DevOps.

    That's why metrics play a very important role in a DevOps culture. They help determine gradual success over time in DevOps transformations. Here are the main metrics being used in DevOps based organizations today:

    1. Mean time to change

    2. Mean time to recovery

    3. Frequency of Release

    4. Defect Rates

    What is Enterprise DevOps?

    Enterprise DevOps is the application of DevOps values in an environment that contains any of the following:

    • Many inter-dependent and related systems and sub systems, software and teams that rely on each other
    • Monolithic systems and/or static software/hardware environments
    • Lengthy Approval Gates or change control processes
    • Security, financial medical or other compliance requirements
    • Lengthy waterfall based processes among multiple teams and stake holders
    • Workflows that are mostly manual and error prone across multiple teams or stakeholders

    In order to create a pipeline-centered organization that is able to seize opportunities in the marketplace quickly.

    Devops is...

    Infrastructure & Code

    Treated as software, where



    Is created with fast feedback cycles, where



    Do not require monkey work by people, where



    Share knowledge, and work creatively, with repetitive work automated



    BBBCTC - Branching, Binaries, Builds, Coding and Testing Culture

    As part of a major drive to push for faster "time to main" you might need to change things in the following five areas (that end up affecting how long people take to get their code checked into "trunk":


    Branching Culture

    Branches means merges, and merges take time. Branching culture in many places also mean "only merge when you feel you won't break things for anyone else" which effectively means many places use branches to "promote" code to the next stage of stability. This in turn means people might spend a lot of time "stabilizing for others" code before moving it onto the next branch (system testing the code locally, really).

    Less branches - less time to main!

    Binaries Sharing Culture

    When working with multiple teams that depend on each other's outputs, moving binaries between the teams becomes a time consuming challenge. I've seen teams use shared folders, emails and even Sharepoint pages to share their binaries across development groups.

    Find a simple way for teams to share and consume binaries easily with each other, and you've saved a world of hurt and time for everyone involved (for example: dependency stash).



    If you're going to reduce the amount of branches, you probably will need gated builds to make sure people don't check in code that breaks compilation on the main branch (trunk). This would be the first level of a build (gated build per module). the second level would possibly an hourly (or faster) integration build that also runs full systematic smoke tests and other type

    Also, if an integrated build does break, you will need to put fixing a red build as high priority for those who broke it, or those who have to adapt, before moving on to new features.

    One more: you will probably need a  cross-group build triage team that notices red builds as they  occur and has the ability to find out what is causing a red build, and where to redirect the request-to-fix toward.

    Testing Culture

    You will need to automate a bunch of the manual testing many groups so they can run it in a gated build, and for the smoke and system tests at the integrated build level.

    Coding culture

    Developers will need to learn and apply feature toggles (code switches) or learn how to truly work incrementally to support new features on the main trunk, that take longer than a day or so to build, without breaking everyone's code.

    There are two types of breakages that can occur on the main trunk (assuming gated builds took care of compile and module level unit and smoke tests):

    •  Unintended breakages : can usually be fixed easily directly on the trunk by the group that broke them, or by another group that needs to adapt to the new code.
    •  Intended breakages.   These can be broken up into:
    • short-lived intended breakages (such as group x has a new API change, they check it into main and expect group B to get it from main, adapt to it with a simple code change and check their stuff back into Main . this should usually be within an hour's work).
    • Long lived intended breakages: you're working on a feature for the next month or so. In that case you either work fully incrementally on that new feature, or you enable feature toggles (a.k.a 'branch by abstraction').

    Either way we are talking about teaching developers a new culture of coding.

    Without this culture coding change , developers will always be afraid of working directly on main and you'll be back to private branches with everything they entail (a.k.a - 'long time to main').


    Time to Main - Measuring Continuous Integration

    A.k.a "Time to Trunk".

    How do you measure your progress on the way to continuous integration(we're not even discussing delivery yet)? How do you know you are progressing towards the goal incrementally?

    Assuming you're in an enterprise situation, showing a KPI over time can help drive a goal forward.

    One of the KPIs I've used on a project was "Time to Main". 

    Let's start with the basic situation we'd like to fix, assuming multiple teams with bad architecture that causes many dependencies between the teams. See below chart:

    In the chart above we have four teams and one main branch. Each team also has it's own private branch(es) structure. A usual cycle of crating an application version is:

     - new NEO binaries distributed to the other teams. HAMMER is adapted and then send to the ORCA team. ORCA team produces binaries to the DAGON team which then produce ORCA+HAMMER+DRAGON components to the ADAM team.

    Finally, they all merge to the main branch, in specific order.

    Obviously, not the best situation in the world. There are many things we can do to make things better, but first, we need a way to measure that we ARE making things better. 

    One measure I like to use is "Time to Main" . How long does  it take for a FULL version of all components that work with each other to get from developer's hands to MAIN?

    In our case the answer starts with:

    1 Day Hammer work + 3 days ORCA work + 3 days DRAGON work +  5 days ADAM work = 12 Days to Main.

    But wait , there's more.

    Many of the back and forth lines are actually branch merging actions. For large code bases, add a few hours per merge, which can add up to 2-3 days if not more. Total could be 14-16 days to Main, starting with team NEO publishing a new feature( in this case they are not in the same source control - as they are in a different company).

    We are assuming that the timings mentioned here are "friction time" only. Time that is not used for code development, but the absolute necessary time developers need to feel confident to merge to MAIN . This might mean they run a bunch of manual or automated tests, debugging, installs. anything. It is time that will be spent whether there is or isn't major functionality being introduced, because the integration with the other components has a cost.

    There are many things we can choose to do here:

    • Ask all teams to work directly on main
    • Create a "Dependency cache" from the continuous build on "Main" so everyone can access the binaries.
    • Reduce manual testing with automation
    • Etc..

    No matter what we do, we can always ask ourselves "How does this affect time to main?" We can always measure that tie, and see if we are actually helping the system, or just implementing local efficiencies while Time to Main remains the same, or even increases.

    From the Theory of Constraints, we can borrow the "Constraint" term. Time to main is our Constraint of actually having a shippable product increment we can demo, install or test at the system integration level. The more teams choose to use internal branches, the more time code has to wait until it sees "Main". Now we have a number that tells us if we are getting better or worse in our quest for continuous integration.

    Pipeline Disintegration (Post Build Pipeline)

    Other Names:

    Post Build Hand off, Time Capsule, Strangler Application


    It is very complicated to add features to the current build process, because the current tooling deficiencies, process bureaucracy or the responsible team is reluctant to add those features for their own reasons.


    The current build process is blocking from adding features required for the business to succeed, such as increasing feedback, adding more information, or adding more actions to ease manual work.

    Continuing to bang on that specific door might be a Sisyphean task, and spending too much time on it would be very unproductive and ineffective.


    You'd like to get more feedback sources to the build, but that might interfere with the team's current processes, area of responsibility, or, in regulated environments, even traceability, documentation and workload.

    You and other stakeholders believe that adding those features will increase feedback, quality or other benefits, but tacking new features on the current build seems to be stepping on everyones toes.


    Create a new pipeline that receives a Time Capsule from the current build process. This Time Capsule will contain the end artifacts from the build process (usually the "release" binaries and other support files, and possibly all the source code).

    Now that the time capsule is in your possession, you can put it through a new, separate pipeline that adds the steps that you might care about. Examples might include:

    • Static code analysis
    • Running unit tests (if those couldn't be run in the original build)
    • Deployments

    A more explicit example:

    1. Make sure the "old" build copies or "drops" all related artifacts into a shared folder that will be accessible by the new pipeline. This should happen automatically every time the old build runs, which hopefully will be once or more a day.
    2. Install an instance of a CI server of some sort (let's say it's TeamCity for the sake of argument), and create a new project for the new pipeline. This will hopefully be on a seperate server so as not to interfere with the old build's require hardware and software resources.
    3. The first action as part of the new pipeline will be, instead of using source control, copying the time capsule binaries into the local build agent that will be invoked by the new CI tool during the run.
    4. Next steps might then include running unit tests, static code analysis or anything else your heart desires, without interfering with the old build structure.
    5. Notify stakeholders if the new pipeline fails.

    Split and Parallelize

    Other Names:

    • Split Work


    • The build is taking too long


    • It could be that one or more tasks or build steps that are comprised of many actions are taking too much time. For example: running 1000 regression tests (each taking 1-4 minutes) is taking 24 hours, and this is slowing down the feedback rate from the build.


    Split the step into several parallel running steps, each running a part of the task. The build step will then theoretically be able to be no longer than the longest time it takes to run just one of the split parts in parallel.

    The solution requires: - having multiple build agents, or workers, to be able to run things in Parallel - The split parts need to be somewhat similar in size to gain speed improvement - the amount of split parts should be at least as the amount of how many agents will be able to run each split part.


    Say we have 1000 regression tests and one build step called "Run Regression Tests".

    Step 1: Split the tests to runnable parts

    We split the regression tests into separate runnable "ranges" or categories that can be executed separately from the command line. There are many ways to do this: Split the tests to multiple assemblies with names ending with running numbers, put separate "categories" on different tests, so you can tell the test runner to run only a specific category, and more.

    It is important that the size of the "chunks" you split into will be somewhat similar, or the build speed gains will be less optimal. In our case, we have 1000 tests, and 10 agents the test chunks can run on. So we create 10 test categories named "chunk1", "chunk2" etc..

    An even better scheme would have been if our tests could be split across 10 different logical ideas. Say we are testing our product with 10 different languages, approx. 100 tests per language. It would have been perfect to split them based on language and name each chunk based on that language. Later on, this also makes things more readable at the build server level. If you can only find 2 or 3 logical "categories" for the tests to split based on, (say "db tests", "ui tests" and "perf tests") you might want to go ahead and split each category into chunks to make things faster if you have more than 3 agents. For example "ui-chunk1", "ui-chunk2" until you reach at least the number of agents you have.

    Step 2: Create a parallel test run hierarchy in the CI server

    If you're using TeamCity, you would now create the following: - new sub project called "run regression tests" - 10 steps (assuming you have 10 build agents that can run them) , one for each test chunk:

    • Run UI Tests Chunk 1 (A command line step calling the test runner command with the name of the category to be run)
    • Run UI Tests Chunk 2 (same command line, different category)
    • Run UI Tests Chunk 2
    • Run DB Tests Chunk 1
    • ...
    • Trigger all Regression Tests

    Note the last build step. This build step is merely an empty build step that has a snapshot dependency to all the listed "chunk" steps. when triggered, it should (assuming you have multiple agents enabled and ready to run these) run all the chunks in parallel, which will finish as quickly as it takes the longest running chunk to finish.

    Irrelevant Build


    • The team doesn't seem to care that one of the builds is failing
    • A build is red, but might not be failing. Only one person from a different team or role can tell whether it actually failed or not.


    One possible cause of is might be that the failing build is not essentially relevant to the current team, or the build result is not understandable (see "binary result" in that case).


    Remove the failing build from the visualization screen visible to all the team. By keeping the build there the team will eventually stop caring about "seeing red" on the screen. The red color has to mean something.


    In one of the projects I was involved with, there was a big screen up on the wall that showed the status of all the builds. Some of the builds were green, but there were a couple of the builds that related to QA regressions and UI testing that were practically always red. Because there was a lack of good communication between the QA and dev teams, none of the developers really understood what is wrong with that build, or even cared.

    What's more, the QA lead would later tell me that a red build in QA does not necessarily mean a failed build. Since UI tests were very fragile, a statistical analysis of sorts was used, so that if X builds of a certain number passed, it was considered a success.

    The problem was that the developers were too used to seeing red, and when they saw red they were very used to thinking "not my problem". The first thing I did was remove the always failing builds from the dev wall, so that all the visualized builds would be relevant to the people watching them. Suddenly there was a green hue in the room as all the builds were passing.

    Developers would go by and ask me in amusement, "What is that weird color?" and "what's wrong with the build? it's green!" Then, when the builds failed due to dev issues, developers had more reasons to care.

    I would only show the QA builds again if they were more stable or if there was better cross team communication so that devs could help QA fix the builds. As long as there was nothing they could do about it, they should only see red when it pertains to work they can control.