DevOps Metrics – What are they good for ?

October 31, 2022

Why do we need DevOps Metrics and what are they good for ?

In the past, there have been several attempts at measuring e.g. developer productivity – notably things like “lines of code per day” – that didn’t really work and/or weren’t backed by science. There was anecdotal evidence that “Agile” and “Lean” methods were yielding better results than “Waterfall”, but no hard data.

This all changed when in 2018 the book “Accelerate” (ISBN 978-1-942788-33-1) was published. Authors Jez Humble and Gene Kim were already known for their work in introducing modern methods to the software lifecycle, but the scientific work of Nicole Forsgren backed their knowledge up with science.

Process

For now, we will just focus on Software Delivery Performance, because it will yield good indicators and is measurable quite easily. In their book, the authors laid out 4 basic metrics that will tell you whether your team is top-performing or lagging behind with regards to Software Delivery. These 4 metrics are:

DORA Metrics

Lead time to change: How long does it take from “Commit” to “Deployed into Production” ?

Deployment Frequency: How often is an application deployed into production ?

Change failure rate: The percentage of deployments requiring rollbacks and/or fixes

Mean time to restore: How long does it take to restore after a failure happens on production ?

Let’s take a look at these metrics in detail:

Lead time to Change

When you already know how to fix or implement a certain issue or feature, this measures the time from commit into your source code repository until your code is running in production. In a lot of organizations this will differ between new features, regular bug fixes and hotfix patches. Whenever I have worked with customers on these metrics, my primary interest was the shortest possible path – usually the one for hotfix patches. When you optimize this path first, you probably have optimized the other two as well. If there is no difference for your particular organization, all the better. One of the initial questions I ask about this metric is: “How long does it take for you from Commit to Deploy ? 5 minutes, 5 days, 5 weeks or 5 months ?”. Once we have agreed on a category for the existing process, we can start talking about how much effort is required.

Deployment Frequency

Another important factor is how often you deploy into production, because only code that is running in production generates value. So my question is: “How often do you deploy into production ? Several times per day, every other day, every couple of weeks or every couple of months ?” . If you do 2 deployments per year, usually a lot of manual work is required for the deployment to really fly. Also this usually involves time at the weekend and/or in the evenings and is a lot of stress for everybody involved. This is not feasible anymore when you want to deploy into production several times per day – or at least have the technical ability to do so. Going from twice a year to several times per day involves a great deal of technological and organizational changes within an organization. A great way to start is to automate the existing process without optimizing it. Once everything is automated, you can start optimizing for speed.

There are industries that require e.g. burn-in tests of a certain minimum duration, so you have to think of doing those tests in an interleaving or parallel fashion. When you work with hardware, you might want to run preliminary tests against simulated hardware – it scales more easily and usually yields results faster (Software-in-the-Loop vs. Hardware-in-the-Loop).

Change Failure Rate

This is simply the percentage of the deployments into production where you had to rollback or manually fix things. When your deployment is associated with a ticket in your ticketing system, you can check for error tickets related to your deployment ticket.

Mean Time to Restore

This is the time you need to recover from a failure happening in production. In cases where you already create an error ticket for these occurrences, you can just measure the time from ticket creation to ticket completion.

Summary

It is important to note that optimizing for just one of those metrics will not move you forward in your journey to success, it is the combination of all 4 metrics that defines top-performing organizations.

Are there other metrics ? Of course. Depending on your particular needs or your business model, these might include the likes of Availability, Value Flow or Business Value. These are all valid metrics, but these metrics will suffer in any case if your Software Delivery Performance is not up to speed.

The baseline

It is important to take base metrics at the beginning – before we implement any changes. We start by taking the measurements of the existing process, because we need a baseline for comparison with later values. Then we enter a cycle where we 

Collect – measure the baseline

Assess – analyze the current state and identify the parts we want to improve

Execute – define a change focusing on a specific outcome

Remeasure – measure to validate whether the change brought the intended result or not

Measurement Loops

If a change worked, we continue down that path; if it didn’t work, we revert the change and try a different approach.

Software

Different kinds of software have tried to tackle these and other metrics in the past, e.g. by monitoring state changes in Jira tickets. In the Open Source world, we now have a new contender – Project Pelorus (https://www.konveyor.io/tools/pelorus/). As part of the Kubernetes-centric set of tools in konveyor.io, like the project description says, it brings a dashboard for visualization of those 4 basic DORA (Google’s DevOps Research and Assessment team) metrics based on Prometheus and Grafana. It is easily extended or modified through adding Prometheus exporters or by creating new Grafana  dashboards from the same data set.

Pelorus Logo
Pelorus Logo

The important fact here is that we basically don’t care about the absolute value of those metrics, but about the delta to our current baseline.

The basic architecture of Pelorus looks like this:

Always think about the outcome you would like to achieve first, and then define metrics and visualize them.

Conclusion

So no matter how you measure those metrics, they are a good starting point to put some hard data behind improvements to your Software Deployment Process. Please keep in mind that being able to reliably deploy several times per day will also give you the ability to deploy hotfixes in a timely manner – but that is a story for another day.