In today's software development landscape, effective collaboration among teams and seamless service orchestration are essential. Achieving these goals requires adherence to organizational standards for quality, security, and compliance. Without diligent monitoring, organizations risk losing sight of their delivery workflows, complicating the assessment of impacts on release velocity, stability, developer experience, and overall application performance.
To address these challenges, many organizations have begun tracking DevOps Research and Assessment (DORA) metrics. These metrics provide crucial insights for any team involved in software development, offering a comprehensive view of the Software Development Life Cycle (SDLC). DORA metrics are particularly useful for teams practising DevOps methodologies, including Continuous Integration/Continuous Deployment (CI/CD) and Site Reliability Engineering (SRE), which focus on enhancing system reliability.
However, the collection and analysis of these metrics can be complex. Decisions about which data points to track and how to gather them often fall to individual team leaders. Additionally, turning this data into actionable insights for engineering teams and leadership can be challenging.
The DORA research team at Google conducts annual surveys of IT professionals to gather insights into industry-wide software delivery practices. From these surveys, four key metrics have emerged as indicators of software teams' performance, particularly regarding the speed and reliability of software deployment. These key DORA metrics include:
DORA metrics connect production-based metrics with development-based metrics, providing quantitative measures that complement qualitative insights into engineering performance. They focus on two primary aspects: speed and stability. Deployment frequency and lead time for changes relate to throughput, while time to restore services and change failure rate address stability.
Contrary to the historical view that speed and stability are opposing forces, research from DORA indicates a strong correlation between these metrics in terms of overall performance. Additionally, these metrics often correlate with key indicators of system success, such as availability, thus offering insights that benefit application performance, reliability, delivery workflows, and developer experience.
While DORA DevOps metrics may seem straightforward, measuring them can involve ambiguity, leading teams to make challenging decisions about which data points to use. Below are guidelines and best practices to ensure accurate and actionable DORA metrics.
Establishing a standardized process for monitoring DORA metrics can be complicated due to differing internal procedures and tools across teams. Clearly defining the scope of your analysis—whether for a specific department or a particular aspect of the delivery process—can simplify this effort. It’s essential to consider the type and amount of work involved in different analyses and standardize data points to align with team, departmental, or organizational goals.
For example, platform engineering teams focused on improving delivery workflows may prioritize metrics like deployment frequency and lead time for changes. In contrast, SRE teams focused on application stability might prioritize change failure rate and time to restore service. By scoping metrics to specific repositories, services, and teams, organizations can gain detailed insights that help prioritize impactful changes.
Best Practices for Defining Scope:
To maintain consistency in collecting DORA metrics, address the following questions:
1. What constitutes a successful deployment?
Establish clear criteria for what defines a successful deployment within your organization. Consider the different standards various teams might have regarding deployment stages. For instance, at what point do you consider a progressive release to be "executed"?
2. What defines a failure or response?
Clarify definitions for system failures and incidents to ensure consistency in measuring change failure rates. Differentiate between incidents and failures based on factors such as application performance and service level objectives (SLOs). For example, consider whether to exclude infrastructure-related issues from DORA metrics.
3. When does an incident begin and end?
Determine relevant data points for measuring the start and resolution of incidents, which are critical for calculating time to restore services. Decide whether to measure from when an issue is detected, when an incident is created, or when a fix is deployed.
4. What time spans should be used for analysis?
Select appropriate time frames for analyzing data, taking into account factors like organization size, the age of the technology stack, delivery methodology, and key performance indicators (KPIs). Adjust time spans to align with the frequency of deployments to ensure realistic and comprehensive metrics.
Best Practices for Standardizing Data Collection:
Before diving into improvements, it’s crucial to establish a baseline for your current continuous integration and continuous delivery performance using DORA metrics. This involves gathering historical data to understand where your organization stands in terms of deployment frequency, lead time, change failure rate, and MTTR. This baseline will serve as a reference point to measure the impact of any changes you implement.
Actionable Insights: If your deployment frequency is low, it may indicate issues with your CI/CD pipeline or development process. Investigate potential causes, such as manual steps in deployment, inefficient testing procedures, or coordination issues among team members.
Strategies for Improvement:
Actionable Insights: Long change lead time often points to inefficiencies in the development process. By analyzing your CI/CD pipeline, you can identify delays caused by manual approval processes, inadequate testing, or other obstacles.
Strategies for Improvement:
Actionable Insights: A high change failure rate is a clear sign that the quality of code changes needs improvement. This can be due to inadequate testing or rushed deployments.
Strategies for Improvement:
Actionable Insights: If your MTTR is high, it suggests challenges in incident management and response capabilities. This can lead to longer downtimes and reduced user trust.
Strategies for Improvement:
Utilizing DORA metrics is not a one-time activity but part of an ongoing process of continuous improvement. Establish a regular review cycle where teams assess their DORA metrics and adjust practices accordingly. This creates a culture of accountability and encourages teams to seek out ways to improve their CI/CD workflows continually.
Etsy, an online marketplace, adopted DORA metrics to assess and enhance its CI/CD workflows. By focusing on improving its deployment frequency and lead time for changes, Etsy was able to increase deployment frequency from once a week to multiple times a day, significantly improving responsiveness to customer needs.
Flickr used DORA metrics to track its change failure rate. By implementing rigorous automated testing and post-mortem analysis, Flickr reduced its change failure rate significantly, leading to a more stable production environment.
Google's Site Reliability Engineering (SRE) teams utilize DORA metrics to inform their practices. By focusing on MTTR, Google has established an industry-leading incident response culture, resulting in rapid recovery from outages and high service reliability.
Typo is a powerful tool designed specifically for tracking and analyzing DORA metrics. It provides an efficient solution for development teams seeking precision in their DevOps performance measurement.