Priyasha Dureja

Technical Content Manager

4 Key DevOps Metrics for Improved Performance

Lots of organizations are prioritizing the adoption and enhancement of their DevOps practices. The aim is to optimize the software development life cycle and increase delivery speed which enables faster market reach and improved customer service. 

In this article, we’ve shared four key DevOps metrics, their importance and other metrics to consider. 

What are DevOps Metrics?

DevOps metrics are the key indicators that showcase the performance of the DevOps software development pipeline. By bridging the gap between development and operations, these metrics are essential for measuring and optimizing the efficiency of both processes and people involved.

Tracking DevOps metrics allows teams to quickly identify and eliminate bottlenecks, streamline workflows, and ensure alignment with business objectives.

Four Key DevOps Metrics 

Here are four important DevOps metrics to consider:

Deployment Frequency 

Deployment Frequency measures how often code is deployed into production per week, taking into account everything from bug fixes and capability improvements to new features. It is a key indicator of agility, and efficiency and a catalyst for continuous delivery and iterative development practices that align seamlessly with the principles of DevOps. A wrong approach in the first key metric can degrade the other DORA metrics.

Deployment Frequency is measured by dividing the number of deployments made during a given period by the total number of weeks/days. One deployment per week is standard. However, it also depends on the type of product.

Importance of High Deployment Frequency

  • High deployment frequency allows new features, improvements, and fixes to reach users more rapidly. It allows companies to quickly respond to market changes, customer feedback, and emerging trends.
  • Frequent deployments usually involve incremental, manageable changes, which are easier to test, debug, and validate. Moreover, It helps to identify and address bugs and issues more quickly, reducing the risk of significant defects in production.
  • High deployment frequency leads to higher satisfaction and loyalty as it allows continuous improvement and timely resolution of issues. Moreover, users get access to new features and enhancements without long waits which improves their overall experience.
  • Deploying smaller changes reduces the risk associated with each deployment, making rollbacks and fixes simpler. Moreover, continuous integration and deployment provide immediate feedback, allowing teams to address problems before they escalate.
  • Regular, automated deployments reduce the stress and fear often associated with infrequent, large-scale releases. Development teams can iterate on their work more quickly, which leads to faster innovation and problem-solving.

Lead Time for Changes

Lead Time for Changes measures the time it takes for a code change to go through the entire development pipeline and become part of the final product. It is a critical metric for tracking the efficiency and speed of software delivery. The measurement of this metric offers valuable insights into the effectiveness of development processes, deployment pipelines, and release strategies.

To measure this metric, DevOps should have:

  • The exact time of the commit 
  • The number of commits within a particular period
  • The exact time of the deployment 

Divide the total sum of time spent from commitment to deployment by the number of commitments made.

Importance of Reduced Lead Time for Changes

  • Short lead times allow new features and improvements to reach users quickly, delivering immediate value and outpacing competitors by responding to market needs and trends timely. 
  • Customers see their feedback addressed promptly, which leads to higher satisfaction and loyalty. Bugs and issues can be fixed and deployed rapidly which improves user experience. 
  • Developers spend less time waiting for deployments and more time on productive work which reduces context switching. It also enables continuous improvement and innovation which keeps the development process dynamic and effective.
  • Reduced lead time encourages experimentation. This allows businesses to test new ideas and features rapidly and pivot quickly in response to market changes, regulatory requirements, or new opportunities.
  • Short lead times help in better allocation and utilization of resources. It helps to avoid prolonged delays and smoother operations. 

Change Failure Rate

Change Failure Rate refers to the proportion or percentage of deployments that result in failure or errors, indicating the rate at which changes negatively impact the stability or functionality of the system. It reflects the stability and reliability of the entire software development and deployment lifecycle. Tracking CFR helps identify bottlenecks, flaws, or vulnerabilities in processes, tools, or infrastructure that can negatively impact the quality, speed, and cost of software delivery.

To calculate CFR, follow these steps:

  • Identify Failed Changes: Keep track of the number of changes that resulted in failures during a specific timeframe.
  • Determine Total Changes Implemented: Count the total changes or deployments made during the same period.

Apply the formula:

Use the formula CFR = (Number of Failed Changes / Total Number of Changes) * 100 to calculate the Change Failure Rate as a percentage.

Importance of Low Change Failure Rate

  • Low change failure rates ensure the system remains stable and reliable which leads to lower downtime and disruptions. Moreover, consistent reliability builds trust with users. 
  • Reliable software increases customer satisfaction and loyalty, as users can depend on the product for their needs. This further lowers issues and interruptions, leading to a more seamless and satisfying experience.
  • Reduced change failure rates result in reliable and efficient software which leads to higher customer retention and positive word-of-mouth referrals. It can also provide a competitive edge in the market that attracts and retains customers.
  • Fewer failures translate to lower costs that are associated with diagnosing and fixing issues in production. This also allows resources to be better allocated to development and innovation rather than maintenance and support.
  • Low failure rates contribute to a more positive and motivated work environment. It further gives teams confidence in their deployment processes and the quality of their code. 

Mean Time to Restore

Mean Time to Restore (MTTR) represents the average time taken to resolve a production failure/incident and restore normal system functionality each week. Measuring "Mean Time to Restore" (MTTR) provides crucial insights into an engineering team's incident response and resolution capabilities. It helps identify areas of improvement, optimize processes, and enhance overall team efficiency. 

To calculate this, add the total downtime and divide it by the total number of incidents that occurred within a particular period.

Importance of Reduced Mean Time to Restore

  • Reduced MTTR minimizes system downtime i.e. higher availability of services and systems, which is critical for maintaining user trust and satisfaction.
  • Faster recovery from incidents means that users experience less disruption. This leads to higher customer satisfaction and loyalty, especially in competitive markets where service reliability can be a key differentiator.
  • Frequent or prolonged downtimes can damage a company’s reputation. Quick restoration times help maintain a good reputation by demonstrating reliability and a strong capacity for issue resolution.
  • Keeping MTTR low helps in meeting these SLAs, avoiding penalties, and maintaining good relationships with clients and stakeholders.
  • Reduced MTTR encourages a proactive culture of monitoring, alerting, and preventive maintenance. This can lead to identifying and addressing potential issues swiftly, which further enhances system reliability.

Other DevOps Metrics to Consider 

Apart from the above-mentioned key metrics, there are other metrics to take into account. These are: 

Cycle Time 

Cycle time measures the total elapsed time taken to complete a specific task or work item from the beginning to the end of the process.

Mean Time to Failure 

Mean Time to Failure (MTTF) is a reliability metric used to measure the average time a non-repairable system or component operates before it fails.

Error Rates

Error Rates measure the number of errors encountered in the platform. It identifies the stability, reliability, and user experience of the platform.

Response Time

Response time is the total time from when a user makes a request to when the system completes the action and returns a result to the user.

How Typo Leverages DevOps Metrics? 

Typo is a powerful tool designed specifically for tracking and analyzing DORA metrics. It provides an efficient solution for development teams seeking precision in their DevOps performance measurement.

  • With pre-built integrations in the dev tool stack, the DORA metrics dashboard provides all the relevant data within minutes.
  • It helps in deep diving and correlating different metrics to identify real-time bottlenecks, sprint delays, blocked PRs, deployment efficiency, and much more from a single dashboard.
  • The dashboard sets custom improvement goals for each team and tracks their success in real time.
  • It gives real-time visibility into a team’s KPI and lets them make informed decisions.

Conclusion

Adopting and enhancing DevOps practices is essential for organizations that aim to optimize their software development lifecycle. Tracking these DevOps metrics helps teams identify bottlenecks, improve efficiency, and deliver high-quality products faster. 

The DORA Lab EP #03 | Ben Parisot - Engineering Manager at Planet Argon

In the third episode of ‘The DORA Lab’ - an exclusive podcast by groCTO, host Kovid Batra has an engaging conversation with Ben Parisot, Software Engineering Manager at Planet Argon, with over 10 years of experience in engineering and engineering management.

The episode starts with Ben offering a glimpse into his personal life. Following that, he delves into engineering metrics, specifically DORA & the SPACE framework. He highlights their significance in improving the overall efficiency of development processes, ultimately benefiting customers & dev teams alike. He discusses the specific metrics he monitors for team satisfaction and the crucial areas that affect engineering efficiency, underscoring the importance of code quality & longevity. Ben also discusses the challenges faced when implementing these metrics, providing effective strategies to tackle them.

Towards the end, Ben provides parting advice for engineering managers leading small teams emphasizing the importance of identifying & utilizing metrics tailored to their specific needs.

Timestamps

  • 00:09 - Ben’s Introduction
  • 03:05 - Understanding DORA & Engineering Metrics
  • 07:51 - Code Quality, Collaboration & Roadmap Contribution
  • 11:34 - Team Satisfaction & DevEx
  • 16:52 - Focus Areas of Engineering Efficiency
  • 24:39 - Implementing Metrics Challenges
  • 32:11 - Ben’s Parting Advice

Links and Mentions

Episode Transcript

Kovid Batra: Hi, everyone. This is Kovid, back with another episode of Beyond the Code by Typo, and today's episode is a bit special. This is part of The DORA Lab series and this episode becomes even more special with our guest who comes with an amazing experience of 10 plus years in engineering and engineering management. He's currently working as an engineering manager with Planet Argon. We are grateful to have you here, Ben. Welcome to the show. 

Ben Parisot: Thank you, Kovid. It's really great to be here. 

Kovid Batra: Cool, Ben. So today, I think when we talk about The DORA Lab, which is our exclusive series, where we talk only about DORA, engineering metrics beyond DORA, and things related to implementation of these metrics and their impact in the engineering themes. This is going to be a big topic where we will deep dive into the nitty gritties that you have experienced with with this framework. But before that, we would love to know about you. Something interesting, uh, about your life, your hobby and your role at your company. So, please go ahead and let us know. 

Ben Parisot: Sure. Um, well, my name is Ben Parisot, uh, as you said, and I am the engineering manager at Planet Argon. We are a Ruby on Rails agency. Uh, we are headquartered in Portland, Oregon in the US but we have a distributed team across the US and, uh, many different countries around the world. We specifically work with, uh, companies that have legacy rails applications that are becoming difficult to maintain, um, either because of outdated versions, um, or just like complicated legacy code. We all know how the older an application gets, the more complex and, um, difficult it can be to work within that code. So we try to come in, uh, help people pull back from the brink of having to do a big rewrite and modernize and update their applications. 

Um, for myself, I am an Engineering Manager. I'm a writer, parts, uh, very, very non-professional musician. Um, I like to read, I really like comic books. I currently am working on a mural for my son, uh, he's turning 11 in about a week, and he requested a giant Godzilla mural painted on his bedroom wall. This is the first time I've ever done a giant mural, so we'll see how it goes. So far, so good. Uh, but he did tell me that, uh, he said, "Dad, even if it's bad, it's just paint." So, I think that.. Uh, still trying to make it look good, but, um, he's, he's got the right mindset, I think about it. 

Kovid Batra: Yeah, I mean, I have to appreciate you for that and honestly, great courage and initiative from your end to take up this for the kid. I am sure you will do a great job there. All the best, man. And thanks a lot for this quick, interesting intro about yourself. 

Let's get going for The DORA Lab. So I think before we deep dive into, uh, what these metrics are all about and what you do, let's have a quick definition of DORA from you, like what is DORA and why is it important and maybe not just DORA, but other metrics, engineering metrics, why they are important. 

Ben Parisot: Sure. So my understanding of DORA is sort of the classical, like it's the DevOps Research and Assessment. It was a think tank type of group just to, I can't remember the company that they started with, but it was essentially to improve productivity specifically around deployments, I believe, and like smoothing out some deployment, uh, and more DevOps-related processes, I think. That's, uh, but, uh, it's essentially evolved to be more about engineering metrics in a broader sense, still pretty focused on deployment. So specifically, like how, how fast can teams deploy code, the frequency of those deployments and changes, uh, to the codebase. Um, and then also metrics around failures and response to failures and how fast people can, uh, or engineering teams respond to incidences. 

Beyond DORA, there's of course the SPACE framework, which is a little bit more broader and looks at some of the more day-to-day processes involved in software engineering, um, and also developer experience. We at Planet Argon, we do a little bit of DORA. We focus mainly on more SPACE-related metrics, um, although there's a bunch of crossover. For us, metrics are very important both for, you know, evaluating the performance of our team, so that we can, you know, show value to our clients and prove, you know, "Hey, this is the value that we are providing beyond just the deliverable." Sometimes because of the nature of our work, we do a lot of work on like the backend or improvements that are not necessarily super-apparent to an end user or even, you know, the stakeholder within the project. So having metrics that we can show to our clients to say, "Hey, this is, um, you know, this is improving our processes and our team's efficiency and therefore, that's getting more value for your budget because we're able to move faster and accomplish more." That's a good thing. Also, it's just very helpful to, you know, keep up good team morale and for longevity sake, like, engineers on our team really like to know where they stand. They like to know how they're doing. Um, they like to have benchmarks on which they can, uh, measure their own growth and know where in sort of the role advancement phase they are based on some, you know, quantifiable metric that is not just, you know, feedback from their coworkers or from clients. 

Kovid Batra: Yeah, I think that's totally making sense to me and while you were talking about the purpose of bringing these metrics in place and going beyond DORA also, that totally relates to the modern software development processes, because you just don't want to like restrict yourself to certain part of engineering efficiency when you measure it, you just don't want to look at the lead time for change, or you just don't want to look at the deployment frequency. There are things beyond these, which are also very important and become, uh, the area of inefficiency or bottlenecks in the team's overall delivery. So, just for example, I mean, this is a question also, whether there is good collaboration between the team or not, right? If there is no good code collaboration, that is a big bottleneck, right? Getting reviews done in a proper way where the quality, the base is intact, that really, really matters. Similarly, if you talk about things like delivery, when you're delivering the software from the planning phase to the actual feature rollout and users using it, so cycle time probably in DORA will cover that, but going beyond that space to understand the project management piece and making sure how much time in total goes into it is again an aspect. Then, there are areas where you would want to understand your team satisfaction and how much teams are contributing towards the roadmap, because that's also how you define that whether you have accumulated too much of technical debt or there are too many bugs coming in and the team is involved right over there. 

And an interesting one which I recently came across was someone was measuring that when new engineers are getting onboarded, uh, how much time does it take to go for the first commit, right? So, these small metrics really matter in defining how the overall efficiency of the engineering or the development process looks like. So, I just wanted to understand from you, just for example, as I mentioned, how do you look at code collaboration or how do you look at, uh, roadmap contribution or how do you look at initial code quality, deliverability, uh, when it comes to your team. And I understand like you are a software agency, maybe a roadmap contribution thing might not be very relevant. So, maybe we can skip that. But otherwise, I think everything else would definitely be relevant to your context. 

Ben Parisot: Sure. Yeah, being an agency, we work with multiple different clients, um, different repos in different locations even, some of them in GitHub, Bitbucket, um, GitLab, like we've got clients with code everywhere. Um, so having consistent metrics across all of like the DORA or SPACE framework is pretty difficult. So we've been able to piecemeal together metrics that make sense for our team. And as you said, like a lot of the metrics, they're for productivity and efficiency sake for sure, but they also then, if you like dig one level deeper, there is a developer experience metric below that. Um, so for instance, PR review, you know, you mentioned, um, like turnaround time on PRs, how quickly are people that are being assigned to review getting to it, how quickly are changes being implemented from after a review has been submitted; um, those are on the surface level, very productivity- driven metrics. We want our team to be moving quickly and reviewing things in a timely manner. But as you mentioned, like a slow PR turnaround time can be a symptom of bad communication and that can lead to a lot of frustration, um, and even like disagreement amongst team members. So that's a really like developer satisfaction metric as well, um, because we want to make sure no one's frustrated with any of their coworkers, uh, or bottlenecked and just stuck not knowing what to do because they have a PR that hasn't been touched. 

We use a couple of different tools. We're luckily a pretty small team, so my job as a manager and collecting all this data from the metrics is doable for now, not necessarily scalable, but doable with the size of our team. I do a lot of manual data collection, and then we also have various third-party integrations and sort of marketplace plugins. So, we work a lot in GitHub, and we use some plugins in GitHub to help us give some insight into, for instance, like you said, like commit time or number of commits within a PR size of those commits you know, we have an engineering handbook that has a lot of our, you know, agreed upon best practices and those are generally in place so that our developers can be more efficient and happy in their work, so, you know, it can feel a little nitpicky to be like, "Oh, you only had two commits in this giant PR." Like, if the work's getting done, the work's getting done. However you know, good commit, best practice. We try to practice atomic commits here at Planet Argon. That is going to, you know, not only like create easier rollbacks if necessary, there's just a lot of reasons for our best practices. So the metrics try to enforce the best practices that we have in mind already, or that we have in place already. And then, yeah, uh, you asked what other tools or? 

Kovid Batra: So, yeah, I mean talking specifically about the team satisfaction piece. I think that's very critical. Like, that's one of the fundamental things that should be there in the team so that you make sure the team is actually productive, right? From what you have explained, uh, the kind of best practices you're following and the kind of things that you are doing within the team reflect that you are concerned about that. So, are there any specific metrics there which you track? Can you like name a few of them for us? 

Ben Parisot: Sure, sure. Um, so for team satisfaction, we track a number of following metrics. We track build processes, code review, deep work, documentation, ease of release, local development, local environment setup, managing technical debt, review turnaround, uh, roadmap and priorities, and test coverage and test efficiency. So these are all sentiment metrics. I use them from a management perspective to not only get a feeling of how the team is doing in terms of where their frustrations lie, but I also use it to direct my work. If I see that some of these metrics or some of these areas of focus are receiving like consistently low sentiment scores, then I can brainstorm with the team, bring it to an all-hands meeting to be like, "Here's some ideas that I have for improving these. What is your input? What's a reasonable timeline look like?" And then, show them that, you know, their continued participation in these, um, these surveys, these developer experience surveys are leading to results that are improving their work experience. 

Kovid Batra: Makes sense. Out of all these metrics that you mentioned, which are those top three or four, maybe? Because it's very difficult to, uh, look at 10, 12 metrics every time, right? So.. 

Ben Parisot: Yes. 

Kovid Batra: There is a go-to metric or there are a few go-to metrics that quickly tell you okay, what's going on, right? So for me, sometimes what I basically do is like if I want to see if the code, initial code quality is coming out good or not I'm mostly looking at how many commits are happening after the PRs are being raised for review and how many comments were there. So when I look at these two, I quickly understand, okay, there is too much to and fro happening and then quality initially is not coming out well. But in the case of team satisfaction, of course, it's a very feeling, qualitative-driven, uh, piece we are talking about. But still, if you have to objectify it with, let's say three or four metrics, what would be those three or four important metrics that you think impact the developer's experience or developer's satisfaction in your team? 

Ben Parisot: Sure. So we actually have like 4 KPIs that we track in addition to those sentiment metrics, and they are also sort of sentiment metrics as well, but they're a little higher level. Um, we track weekly time loss, ease of delivery, engagement, uh, and perceived productivity. So we feel like those touch pretty much all of the different aspects of the software development life cycle or the developer's day-to-day experience. So, ease of delivery, how, you know, how easy is it for you to be, uh, deploying your code, um, that touches on any bottlenecks in the deployment pipelines, any issues with PRs, PR reviews, that sort of thing. Um, engagement speaks to how excited or interested people are about the work that they're doing. So that's the more meat of the software development work. Um, perceived productivity is how, you know, how well you think you are being productive or how productive you feel like you are being. Um, and that's really important because sometimes the hard metrics of productivity and the perceived productivity can be very different, and not only for like, "Oh, you think you're being very productive, but you're not on paper." Um, oftentimes, it's the reverse where someone feels like they aren't being productive at all and I can go, and I know that from their sentiment score. Um, but then I can go and pull up PRs that they've submitted or work that they've been doing in JIRA and just show them a whole list of work that they've done. I feel like sometimes developers are very in the weeds of the work and they don't have a chance to step back and look at all that they've accomplished. So that's an important metric to make sure that people are recognizing and appreciating all of the work and their contributions to a project and not feeling like, "Oh, this one ticket, I haven't been very productive on. So, therefore, I'm not a very good developer." Uh, and then finally, weekly time loss is a big one. This one is more about everything outside of the development work. So this also focuses on like, how often are you in your flow? Are you having too many meetings? Do you feel like, you know, the asynchronous current communication that is just the nature of our distributed team? Is that blocking you? And are you being, you know, held up too much by waiting around for a response from someone? So that's an important metric that we look at as well. 

Kovid Batra: Makes sense. Thanks. Thanks for that detailed overview. I think team satisfaction is of course, something that I also really, really care about. Beyond that, what do you think are those important areas of engineering efficiency that really impact the broader piece of efficiency? So, uh, just to give you an example, is it you're focusing mostly in your teams related to deliverability or are you focusing more related to, uh, the quality of the work or is it something related to maybe sprints? I'm really just throwing out ideas here to understand from you how you, uh, look at which are those important areas of engineering efficiency other than developer satisfaction. 

Ben Parisot: Yeah. I think, right. I think, um, for our company, we're a little bit different even than other agencies. Companies don't come to us often for large new feature development. You know, as I mentioned at the top of the recording, we inherit really old applications. We inherit applications that have, you know, the developers have just given up on. So a lot of our job is modernizing and improving the quality of the code. So, you know, we try to keep our deployment metrics, you know, looking nice and having all of the metrics around deployment and, uh, post-deployment, obviously. Um, but from my standpoint, I really focus on the quality of the code and sort of the longevity of the code that the team is writing. So, you know, we look at coding practices at Planet Argon, we measure, you know, quality in a lot of different ways. Some of them are, like I said earlier, like practicing atomic commits size of PRs. Uh, because we have multiple projects that people are working on, we have different levels of understanding of those projects. So there's like, you know, some people that have a very high domain knowledge of an application and some people that don't. So when we are submitting PRs, we try to have more than one person look at a PR and one person is often coming with a higher domain knowledge and reviewing that code as it, uh, does it satisfy the requirements? Is it high-quality code within the sort of ecosystem of that existing application? And then, another person is looking at more of the, the best practices and the coding standards side of it, and reviewing it just from a more, a little more objective viewpoint and not necessarily as it's related to that.

Let's see, I'm trying to find some specific metrics around code quality. Um, commits after a PR submission is one, you know, if where you are finding that our team is often submitting a PR and then having to go back and work a lot more on it and change a lot more things; that means that those PRs are probably not ready or they're being submitted a little earlier. Sometimes that's a reflection of the developer's understanding of the task or of the code. Sometimes it's a reflection of the clarity of the issue that they've been assigned or the requirements. You know, if the client hasn't very clearly defined what they want, then we submit a PR and they're like, "Oh, that's not what I mean." so that's an important one that we looked at. And then, PR approval time, I would say is another one. Again, that one is both for our clients because we want to be moving as quickly with them as we can, even though we don't often work against hard deadlines. We still like to keep a nice pace and show that our team is active on their projects. And then, it's also important for our team because nobody likes to be waiting around for days and days for their PR to be reviewed. 

Kovid Batra: Makes sense. I think, yeah, uh, these are some critical areas and almost every engineering team struggles with it in terms of efficiency and what I have felt also is not just organization, right, but individual teams have different challenges and for each team, you could be looking at different metrics to solve their problems. So one team would be having a low deployment frequency because of maybe not enough tooling in place and a lot of manual intervention being there, right? That's when their deployments are not coming out well or breaking most of the time. Or it could be for another team, the same deployment frequency could be a reason that the developers are actually not writing or doing enough number of like PRs in a defined period of time. So there is a velocity challenge there, basically. That's why the deployment frequency is low. So most of the times I think for each team, the challenge would be different and the metrics that you pick would be different. So in your case, as you mentioned, like how you do it for your clients and for your teams is a different method. Cool. I think with that, I.. Yeah, you were saying something. 

Ben Parisot: Oh, I was, yeah. I was gonna say, I think that, uh, also, you know, we have sort of across the board, company best practice or benchmarks that we try to reach for a lot of different things. For instance, like test coverage or code coverage, technical debt, and because we inherit codebases in various levels of, um, quality, the metric itself is not necessarily good or bad. The progress towards a goal is where we look. So we have a code coverage metric, uh, or goal across the company of like 80, 85%, um, test coverage, code coverage. And we've inherited apps, big applications, live applications that have zero test coverage. And so, you know, when I'm looking at metrics for tests, uh, you know, it's not necessarily, "Hey, is this application's test coverage meeting our standards? Is it moving towards our standards?" And then it also gets into the individual developers. Like, "Are you writing the tests for the new code that you're writing? And then also, is there any time carved out of the work that you have on that project to backfill tests?" And similarly, with, uh, technical debt, you know, we use a technical debt tagging tool and oftentimes, like every three months or so we have a group session where we spend like an hour, hour and a half with our cameras off on zoom, just going into codebases that we're currently working on and finding as much technical debt as we can. Um, and that's not necessarily like, oh, we're trying to, you know, find who's not writing the best code or what the, uh, you know, trying to find all the problems that previous developers caused. But it's more of a is this, you know, other areas for like, you know, improvements? Right. And also, um, is there any like potential risks in this codebase that we've overlooked just by going through the day-to-day? And so, the goal is not, "Hey, we need to have no technical debt ever." It's, "Are we reducing the backlog of tech debt that we're currently working within?" 

Kovid Batra: Totally, totally. And I think this again brings up that point that for every team, the need of a metric would be very different. In your case, the kind of projects you are getting by default, you have so much of technical debt that's why they're coming to you. People are not managing it and then the project is handed over to you. So, having that test coverage as a goal or a metric is making more sense for your team. So, agreed. I think I am a hundred percent in line with that. But one thing is for sure that there must be some level of, uh, implementation challenges there, right? Uh, it's not like straightforward, like you, you are coming in with a project and you say, "Okay, these are the metrics we'll be tracking to make sure the efficiency is in place or not." There are always implementation challenges that are coming with that. So, I mean, with your examples or with your experience, uh, what do you think most of the teams struggle with while implementing these metrics? And I would be more than happy to hear about some successes or maybe failures also related to your implementation experiences. 

Ben Parisot: Yeah. So I would just say the very first challenge that we face is always, um. I don't want to see team morale, but, um, the, somewhat like overwhelming nature depending on the state of the codebase. Like if you inherit a codebase, it's really large and there's no tests. That's, you know, overwhelming to think about, having to go and write all those tests, but it's also overwhelming and scary to think, "Oh, what if something breaks?" Like, a test is a really good indicator of where things might be breaking and there's none of that, so the guardrails are off. Um, and that's scary. So helping people get used to, especially newer developers who have just joined the team, helping them get used to working within a codebase that might not be as nice and clean as previous ones that they've worked with is a big challenge. In terms of actual implementation, uh, we face a number of challenges being an agency. Like I mentioned earlier, some codebases are in, um, different places like GitHub or Bitbucket. You know, obviously those tools have generally the same features and generally the same, you know, sort of dashboard type of things. Um, but if we are using any sort of like integrated tool to measure metrics around those things, if we get it, um, a repo that's not in the platform, it's not on the platform where we have that integration happening, then we don't get the metrics on that, or we have to spin up a new integration. 

Kovid Batra: Yeah. 

Ben Parisot: Um, for some of our clients, we have access to some of their repos and not others, and so, like we are working in an app ecosystem where the application that we are responsible for is communicating and integrated with another application that we don't, we can't see; and so that's very difficult at times. That can be a challenge for implementing certain metrics, because we need to know, like, especially performance metrics for the application. Something might be happening on this hidden application that we don't have any control over or visibility into. 

Um, and then what else? Just I would say also a challenge that we face is the, um, most of our developers are working on 2 to 3 applications at a time, and depending on the length of the engagement, um, sometimes people will switch on and off. So it can be difficult to track metrics for just a single project when developers are working on it for like maybe a few weeks or a few months and then leaving. Sometimes we have like a dedicated developer who's lead and then, have a support developer come in when necessary. And so that can be challenging if we're trying to parse out, like why there might've been a shift in the metrics or like a spike in one metric or another, or a drop and be like, "Okay, well, let's contextualize that around who was working on this project, try to determine like, okay, is this telling us something important about the project itself, or is it just data that is displaying the adding or subtracting of different developers to the project?" So that can be a challenge. 

Specifically, I can mention an actual sort of case study that we had. Uh, we were using Code Climate, which is a tool that we still use. We use the quality tool for like audits and stuff. Um, but when I first started applying to Argon, I wanted to implement its velocity tool, which is like the sister tool to quality and it is like very heavily around cycle time. Um, and it was all great, I was very excited about it. Went and signed up, um, went and connected our GitHub accounts, or I guess I did the Bitbucket account at the time cause most of our repos were in Bitbucket. Um, didn't realize at the time at least that you could only integrate with one platform. And so, even though we had, uh, we had accounts and we had clients with applications on GitHub, we were only able to integrate with Bitbucket. So some engineers' work was not being caught in that tool at all because they were working primarily in GitHub application. And again, like I said, sometimes developers would then go to one of the applications in Bitbucket, help out and then drop off. So it was just causing a lot of fluctuations in data and also not giving us metrics for the entire team consistently. So we eventually had to drop it because it was just not a very valuable tool, um, in that it was not capturing all of the activities of all of our engineers everywhere they were working. Um, I wished that it was, but it's the nature of the agency work and also, you know, having people that are, um. 

Kovid Batra: Yeah, I totally agree on that point and the challenges that you're facing are actually very common, but at the same time, uh, having said that, I believe the tooling around metrics observation and metrics monitoring has come way ahead of what you have been using in Code Climate. So, the challenge still remains, like most of the teams try to gather metrics manually, which is time-consuming, or in your case where agencies are working on different projects, it's very difficult or different codebases, very difficult to gather the right metrics for individual developers there also. Somehow, I think the challenges are very valid, but now, the tooling that is available in the market is able to cater to all those challenges. So maybe you want to give it a try and see, uh, your metrics implementation getting in place. But yeah, I think, thanks for highlighting these pointers and I think a lot of people, a lot of engineering managers and engineering leaders struggle with the same challenges while implementing those. So totally, uh, bringing these challenges in front of the audience and talking about it would bring some level of awareness to handle these challenges as well. 

Great. Great, Ben. I think with this, uh, we would like to bring an end to our today's episode. It was really amazing to understand how Planet Argon is working and you are dealing with those challenges of implementing metrics and how well you are actually doing, even though the right tooling or right things are not in place, but the important part is you realize the purpose. You don't probably go ahead and do it for the sake of doing it. You're actually doing it where you have a purpose and you know that this can impact the overall productivity of the team and also bring credibility with your clientele that yes, we are doing something and you have something to show in numbers also. So, I really appreciate that. 

And while, like before we say goodbye, is there parting advice or something that you would like to speak with the audience? Please go ahead. 

Ben Parisot: Oh, wow! Um, yeah, sure. So I think your point about like understanding the purpose of the metrics is important. You know, my team, uh, I am the manager of a small team and a small company. I wear a lot of hats and I do a lot of different things for my team. They show me a lot of grace, I suppose, when I have, you know, incomplete data for them, I suppose. Like you said, there's a lot of tools out there that can provide a more holistic, uh, look. Um, and I think that if you are an agency, uh, if you're a manager on a small team and you sort of struggle to sort of keep up with all of the metrics that you have even promised for your team or that you know that you should be doing, uh, if you really focus on the ones that are impacting their day-to-day experience as well as like the value that they're providing for either, you know, your company's internal stakeholders or external clients, you're going to quickly see the metrics that are most important and your team is going to appreciate that you're focusing on those, and then, the rest of it is going to fall into place when it does. And when it doesn't, um, you know, your team's not going to really be too upset because they know, they see you focusing on the stuff that matters most to them. 

Kovid Batra: Great. Thanks a lot, Ben. Thank you so much for such great, insightful experiences that you have shared with us. And, uh, we wish you all the best, uh, and your kid a very happy birthday in advance. 

Ben Parisot: Thank you. 

Kovid Batra: All right, Ben. Thank you so much for your time. Have a great day. 

Ben Parisot: Yes. Thanks.

Top Platform Engineering KPIs You Need to Monitor

Platform Engineering is becoming increasingly crucial. According to the 2024 State of DevOps Report: The Evolution of Platform Engineering, 43% of organizations have had platform teams for 3-5 years. The field offers numerous benefits, such as faster time-to-market, enhanced developer happiness, and the elimination of team silos.

However, there is one critical piece of advice that Platform Engineers often overlook: treat your platform as an internal product and consider your wider teams as your customers.

So, how can they do this effectively? It's important to measure what’s working and what isn’t using consistent indicators of success.

In this blog, we’ve curated the top platform engineering KPIs that software teams must monitor:

What is Platform Engineering?

Platform Engineering, an emerging technology approach, enables the software engineering team with all the required resources. This is to help them perform end-to-end operations of software development lifecycle automation. The goal is to reduce overall cognitive load, enhance operational efficiency, and remove process bottlenecks by providing a reliable and scalable platform for building, deploying, and managing applications. 

Importance of Tracking Platform Engineering KPIs

Helps in Performance Monitoring and Optimization

Platform Engineering KPIs offer insights into how well the platform performs under various conditions. They also help to identify loopholes and areas that need optimization to ensure the platform runs efficiently.

Ensures Scalability and Capacity Planning

These metrics guide decisions on how to scale resources. It also ensures the capacity planning i.e. the platform can handle growth and increased load without performance degradation. 

Quality Assurance

Tracking KPIs ensure that the platform remains robust and maintainable. This further helps to reduce technical debt and improve the platform’s overall quality. 

Increases Productivity and Collaboration

They provide in-depth insights into how effectively the engineering team operates and help to identify areas for improvement in team dynamics and processes.

Fosters a Culture of Continuous Improvement

Regularly tracking and analyzing KPIs fosters a culture of continuous improvement. Hence, encouraging proactive problem-solving and innovation among platform engineers. 

Top Platform Engineering KPIs to Track 

Deployment Frequency 

Deployment Frequency measures how often code is deployed into production per week. It takes into account everything from bug fixes and capability improvements to new features. It is a key metric for understanding the agility and efficiency of development and operational processes and highlights the team’s ability to deliver updates and new features.

The higher frequency with minimal issues reflects mature CI/CD processes and how platform engineering teams can quickly adapt to changes. Regularly tracking and adapting Deployment Frequency helps in continuous improvement as it reduces the risk of large, disruptive changes and delivers value to end-users effectively. 

Lead Time for Changes

Lead Time is the duration between a code change being committed and its successful deployment to end-users. It is correlated with both the speed and quality of the platform engineering team. Higher lead time gives a clear sign of roadblocks in processes and the platform needs attention. 

Low lead time indicates that the teams quickly adapt to feedback and deliver products timely. It also gives teams the ability to make rapid changes, allowing them to adapt to evolving user needs and market conditions. Tracking it regularly helps in streamlining workflows and reducing bottlenecks. 

Change Failure Rate

Change Failure Rate refers to the proportion or percentage of deployments that result in failure or errors. It indicates the rate at which changes negatively impact the stability or functionality of the system. CFR also provides a clear view of the platform’s quality and stability eg: how much effort goes into addressing problems and releasing code.

Lower CFR indicates that deployments are reliable, changes are thoroughly tested, and less likely to cause issues in production. Moreover, it also reflects a well-functioning development and deployment processes, boosting team confidence and morale. 

Mean Time to Restore

Mean Time to Restore (MTTR) represents the average time taken to resolve a production failure/incident and restore normal system functionality each week.  Low MTTR indicates that the platform is resilient, quickly recovers from issues, and efficiency of incident response. 

Faster recovery time minimizes the impact on users, increasing their satisfaction and trust in service. Moreover, it contributes to higher system uptime and availability and enhances your platform’s reputation, giving you a competitive edge. 

Resource Utilization 

This KPI tracks the usage of system resources. It is a critical metric that optimizes resource allocation and cost efficiency. Resource Utilization balances several objectives with a fixed amount of resources. 

It allows platform engineers to distribute limited resources evenly and efficiently and understand where exactly to spend. Resource Utilization also aids in capacity planning and helps in avoiding potential bottlenecks. 

Error Rates

Error Rates measure the number of errors encountered in the platform. It identifies the stability, reliability, and user experience of the platform. High Error Rates indicate underlying problems that need immediate attention which can otherwise, degrade user experience, leading to frustration and potential loss of users.

Monitoring Error Rates helps in the early detection of issues, enabling proactive response, and preventing minor issues from escalating into major outages. It also provides valuable insights into system performance and creates a feedback loop that informs continuous improvement efforts. 

Team Velocity 

Team Velocity is a critical metric that measures the amount of work completed in a given iteration (e.g., sprint). It highlights the developer productivity and efficiency as well as in planning and prioritizing future tasks. 

It helps to forecast the completion dates of larger projects or features, aiding in long-term planning and setting stakeholder expectations. Team Velocity also helps to understand the platform teams’ capacity to evenly distribute tasks and prevent overloading team members. 

How to Develop a Platform Engineering KPI Plan? 

Define Objectives 

Firstly, ensure that the KPIs support the organization’s broader objectives. A few of them include improving system reliability, enhancing user experience, or increasing development efficiency. Always focus on metrics that reflect the unique aspects of platform engineering. 

Identify Key Performance Indicators 

Select KPIs that provide a comprehensive view of platform engineering performance. We’ve shared some critical KPIs above. Choose those KPIs that fit your objectives and other considered factors. 

Establish Baseline and Targets

Assess current performance levels of software engineers to establish baselines. Set targets and ensure they are realistic and achievable for each KPI. They must be based on historical data, industry benchmarks, and business objectives.

Analyze and Interpret Data

Regularly analyze trends in the data to identify patterns, anomalies, and areas for improvement. Set up alerts for critical KPIs that require immediate attention. Don’t forget to conduct root cause analysis for any deviations from expected performance to understand underlying issues.

Review and Refine KPIs

Lastly, review the relevance and effectiveness of the KPIs periodically to ensure they align with business objectives and provide value. Adjust targets based on changes in business goals, market conditions, or team capacity.

Typo - An Effective Platform Engineering Tool 

Typo is an effective platform engineering tool that offers SDLC visibility, developer insights, and workflow automation to build better programs faster. It can seamlessly integrate into tech tool stacks such as GIT versioning, issue tracker, and CI/CD tools.

It also offers comprehensive insights into the deployment process through key metrics such as change failure rate, time to build, and deployment frequency. Moreover, its automated code tool helps identify issues in the code and auto-fixes them before you merge to master.

Typo has an effective sprint analysis feature that tracks and analyzes the team’s progress throughout a sprint. Besides this, It also provides 360 views of the developer experience i.e. captures qualitative insights and provides an in-depth view of the real issues.

Conclusion

Monitoring the right KPIs is essential for successful platform teams. By treating your platform as an internal product and your teams as customers, you can focus on delivering value and driving continuous improvement. The KPIs discussed above, provide a comprehensive view of your platform's performance and areas for enhancement. 

There are other KPIs available as well that we have not mentioned. Do your research and consider those that best suit your team and objectives.

All the best! 

‘Evolution of Software Testing: From Brick Phones to AI’ with Leigh Rathbone, Head of Quality Engineering at CAVU

In the latest episode of ‘groCTO: Originals’ (formerly ‘Beyond the Code: Originals’), host Kovid Batra engages with Leigh Rathbone, Head of Quality Engineering at CAVU who has a rich technical background with reputable organizations like Sony Ericsson and The Very Group. He has been at the forefront of tech innovation, working on the first touchscreen smartphone and smartwatch, and later with AR, VR, & AI tech. The conversation revolves around ‘Evolution of Software Testing: From Brick Phones to AI’.

The podcast begins with Leigh introducing himself and sharing a life-defining moment in his career. He further highlights the shift from manual testing to automation, discussing in depth the automation framework for touchscreen smartphones from 19 years ago. Leigh also addresses the challenges of implementing AI and how to motivate teams to explore automation opportunities. He also discusses the evolution of AR, VR, and 3D gaming & their role in shaping modern-day testing practices, emphasizing the importance of health and safety considerations for testers.

Lastly, Leigh offers parting advice urging software makers & testers to always prioritize user experience & code quality when creating software.

Timestamps

  • 00:06 - Leigh’s Introduction
  • 01:07 - Life-defining Moment in Leigh’s Career
  • 04:10 - Evolution of Software Testing
  • 09:20 - Role of AI in Testing
  • 11:14 - Conflicts with Implementing AI
  • 15:29 - Adapting to AI with Upskilling
  • 21:02 - Evolution of AR, VR & 3D Gaming
  • 25:45 - Unique Value of Humans in Testing
  • 32:58 - Conclusion & Parting Advice

Links and Mentions

Episode Transcript

Kovid Batra: Hi, everyone. This is Kovid, back with another episode of Beyond the Code by Typo. Today, we are lucky to have a tech industry veteran with us on our show today. He is the Head of Quality Engineering at CAVU. He has had fascinating 25-plus years of engineering and leadership experience, working on cutting-edge technologies including the world's first smartphone and smartwatch. He was also involved in the development of progressive download and DRM technologies that laid the groundwork for modern streaming services. We are grateful to have you on the show, Leigh. 

Leigh Rathbone: Thank you, Kovid. It's great to be here. I'm really happy to be invited. I'm looking forward to sharing a few experiences and a few stories in order to hopefully inspire and help other people in the tech industry. 

Kovid Batra: Great, Leigh. And today, I think they would have a lot of things to deep dive into and learn from you, from your experience. But before we go there, where we talk about the changing landscape of software testing, coming from brick phones to AI, let's get to know a little bit more about each other. Can you just tell us something about yourself, some of your life-defining experiences so that I and the audience can know you a little more? 

Leigh Rathbone: Yeah. Well, I'm Leigh Rathbone. I live in the UK, uh, in England. I live just North of a city called Liverpool. People might've heard of Liverpool because there's a few famous football teams that come from there, but there's also a famous musical band called the Beatles that came from Liverpool. So, I live just North of Liverpool. I have two children. I'm happily married, been married for over 20 years. I am actually an Aston Villa football fan. I don't support any of the Liverpool football clubs. I'm not a cricket fan or a rugby fan. It's 100 percent football for me. I do like a bit of fitness, so I like to get out on my bike. I like to go to the gym. I like to drink alcohol. Am I allowed to say that, Kovid? Am I allowed to say that? I do like a little bit of alcohol. Um, and like everybody else, I think I'm addicted to Netflix and all the streaming services, which is quite emotional for me, Kovid, because having played a part in the building blocks and a tiny, tiny part in the building blocks of what later became streaming, when I'm listening to Spotify or when I'm watching something on Amazon Video or Netflix, I do get a little bit emotional at times thinking, "Oh my God! I played a minute part of that technology that we now take for granted." So, that's my sort of out-of-work stuff that, um, I hope people will either find very boring or very interesting, I don't know. 

Kovid Batra: No, I definitely relate to it and I would love to know, like, which was the last, uh, series you watched or a movie you watched on Netflix and what did you love about it? 

Leigh Rathbone: So, I watched a film last night called 'No Escape'. Um, it's a family that goes to, uh, a country in Asia and they don't say the name of the country for legal reasons. Um, but they get captured in a hotel and it's how they escape from some terrorists in a hotel with the help of Brosnan who's also in the film. So, yeah, it was, uh, it was high intensity, high energy and I think that's probably why I liked it because from the very almost 5-10 minutes, it's like, whoa, what's going on here? So, it was called 'No Escape'. It's on Netflix in the UK. I don't know whether it'll be on Netflix across the world. But yeah, it's an old film. It's not new. I think it's about three years old. But yeah, it was quite enjoyable. 

Kovid Batra: Cool, cool. I think that that's really interesting and thank you for such a quick, sweet intro about yourself. And of course, your contributions are not minute. Uh, I'm sure you would have done that in that initial stage of tech when the technology was building up. So, thanks on behalf of the tech community there. 

Uh, all right, Leigh, thank you so much and let's get started on today's main topic. So, you come from a background where you have seen the evolution of this landscape of software testing and as I said earlier, also like from brick phones to AI, I'm sure, uh, you have a lot of experiences to share from the days when it all started. So, let's start from the part where there was no automation, there was manual testing, and how that evolved from manual testing to automation today, and how things are being balanced today because we are still not 100 percent automated. So, let's talk about something like, uh, your first smartphone, uh, maybe where you might not have all the automation, testing or sophisticated automation possible. How was your experience in that phase? 

Leigh Rathbone: Well, I am actually holding up for those people that, uh, can either watch the video. 

Kovid Batra: Oh my God! Oh my God! 

Leigh Rathbone: I'm holding up the world's first touchscreen smartphone and you can see my reflection and your reflection on the screen there. This is called the Sony Ericsson P800. I worked on this in 2002 and it hit the market in 2003 as the world's first touchscreen smartphone, way before Apple came to the market. But actually, if I could, Kovid, can I go back maybe four years before this? Because there's a story to be told around manual testing and automation before I got to this, because there is automation, there is an automation story for this phone as well. But if I can start in 1999, I've been in testing for 12 months and I moved around a lot in my first four years, Kovid because I wanted to build up my skillsets and the only way to do that was to move jobs. So, my first four years, I had four jobs. So, in 1999 I'm in my second job. I'm 12 months into my testing career and I explore a tool called WinRunner. I think it was the first automation tool. So, there I am in 1999, writing automation scripts without really knowing the scale of what was going to unfold in front of not just the testing community, but the tech community. And when I was using this tool called WinRunner. Oh, Kovid, it was horrible. Oh my God! So, I would be writing scripts and it was pretty much record and playback, okay? So, I was clicking around, I was looking at the code, I was going, "Oh, this is exciting." And every time a new release would come from the developers, none of my scripts would work. You know the story here, Kovid. As soon as a new release of code comes out, there's bug fixes, things move around on the screens, you know, different classes change, there might be new classes. This just knocks out all of my scripts. So, I'd spend the next sort of, I don't know, eight days, working, reworking, refactoring my automation scripts. And then, I just get to the point where I was tackling new scripts for the new code that dropped and a new drop of code would come. And I found myself in this cycle in 1999 of using WinRunner and getting a little bit excited but getting really frustrated. And I thought, "Where is this going to go? Has it got a future in tech? Has it got a future in testing?" Cause I'm not seeing the return on investment with the way I was using it. So, at that point in time, 1999, I saw a glimpse, a tiny glimpse of the future, Kovid. And that was 25 years ago. And for the next couple of years, I saw this slow introduction, very, very slow back then, Kovid, of manual testing and automation. And the two were very separate, and that continued for a long, long time, whereby you'd have manual testers and automation testers. And I feel that's almost leading and jumping ahead because I do want to talk about this phone, Kovid, because this phone was touchscreen, and we had automation in 2005. We built an automation framework bespoke to Sony Ericsson that would do stress testing, soak testing, um, you know, um, it would actually do functional automation testing on a touchscreen smartphone. Let that sink into 19 years ago. We built a bespoke automation framework for the touchscreen smartphone. Let that sink in folks. 

Kovid Batra: Yeah. 

Leigh Rathbone: Unreal, absolutely unreal, Kovid. Back in the day, that was pretty much unheard of. Unbelievable. It still blows my mind to this day that in 2005, 19 years ago, on a touchscreen smartphone, we had an automation framework that added loads and loads of value. 

Kovid Batra: Totally, totally. And was this your first project wherein you actually had a chance to work hands-on with this automation piece? Like, was that your first project? 

Leigh Rathbone: So, what happened here, Kovid, and this is a trend that happened throughout the testing and tech industry right up until about, I'd say that seven years ago, we had an automation team and a manual team. I'll give you some context for the size. The automation team was about five people. The manual test team was about 80 people. So, you can see the contrast there. So, they were doing pretty much what I was doing in 1999. They were writing some functional test scripts that we could use for regression testing. Uh, but they were mostly using it for soak testing. So in other words, random touches on the screen, these scripts needed to run for 24 hours in order for us to be able to say, "Yes, that can, that software will exist in live with a customer touching the screen for 24 hours without having memory leaks, as an example." So, their work felt very separate to what we were doing. There was a slight overlap with the functional testing where they'd take some of our scripts and turn them into, um, automated regression packs. But they were way ahead of the curve. They were using this automation pack for soak testing to make sure there was no memory leaks by randomly dibbing on a screen. And I say dibbing, Kovid, because you touched the screen with a dibber, right? It wasn't a finger. Yeah, you needed this little dibber that clicked onto the side of the phone in order to touch the screen. So, they managed to mimic random clicks on the screen in order to test for memory leaks. Fascinating. Absolutely fascinating. So at that point, we started to see a nice little return on investment on automation being used. 

Kovid Batra: Got it. Got it. And from there, how did it get picked up over the years? Like, how have teams collaborated? Was there any resistance from, of course, every time this automation piece comes in, uh, there is resistance also, right? People start pointing things. So, how was that journey at that point? 

Leigh Rathbone: I think there's always resistance to change and we'll see it with AI. When we come on to the AI section of the talk, we'll see it there. There will always be resistance to change because people go through fear when change is announced. So, if you're talking to a tester, a QA or a QE and you're saying, "Look, you're going to have to change your skill sets in order to learn this." They're gonna go through fear before they spot the opportunity and come up the other side. So, everywhere I've gone, there's been resistance to automation and there's something else here, Kovid, from the years 1998 to 2015, test teams were massive. They were huge. And because we were in the waterfall methodology, they were pretty much standalone teams and all the people that were in power of running these big teams, they had empires, and they didn't want to see those empires come down. So actually, resistance wasn't just sometimes from testers themselves, it was from the top, where they might say, "Oh, this might mean that the number of testers I need goes down, so, therefore, my empire shrinks." And there were test leaders out there, Kovid, doing that, very, very naughty people. Like, honestly, trying to protect their own, their own job, instead of thinking about the future. So, I saw some testers try and accelerate the use of automation. I also saw test leaders put the brakes on it because they were worried about the status of their jobs and the size of their empires. 

Kovid Batra: Right. And I think fast-forward to today, we won't take much long to jump to the AI part here. Like, a lot of automation Is already in place. According to your, uh, view of the tech industry right now uh, let's say, if there are a hundred companies; out of those hundred, how many are at a scale where they have done like 80 percent or 90 percent of automation of testing? 

Leigh Rathbone: God! 80 to 90 percent automation of testing. You'll never ever reach that number because you can do infinite amounts of testing, okay? So, let's put that one out there. The question still stands up. You're asking of 100 companies, how many people have automation embedded in their DNA? So I would probably, yeah, I would probably say it's in the region of 70 to 80 percent. And I'd be, I wouldn't be surprised if it's higher, and I've got no data to back that up on. What I have got to back that up on is the fact that I've worked in 14 different companies, and I spend a lot of time in the tech industry, the tech communities, talking to other companies. And it's very rare now that you come across a company that doesn't have automation. 

But here's the twist, Kovid, there's a massive twist here. I don't employ automation testers, okay? So 2015, I made a conscious effort and decision not to employ automation testers. I actually employed testers who can do the exploratory side and the automation side. And that is a trend now, Kovid, that really is a thing. Not many companies now are after QAs that only do automation. They want QAs that can do the exploratory, the automation, a little bit of performance, a little bit of security, the people skills is obviously rife. You know, you've got to put those in there with everything else. 

Kovid Batra: Of course. 

Leigh Rathbone: Yeah. So for me now, this trend that sort of I spotted in 2014 and I started doing in 2015 and I've done it at every company I've been to, that really is the big trend in testers and QAs right now. 

Kovid Batra: Got it. I think it's more like, uh, it's an ever-growing evolutionary discipline, right? Uh, every time you explore new use cases and it also depends on the kind of business, the products the company is rolling out. If there are new use cases coming in, if there are new products coming in, every time you can just have everything automated. So yeah, I mean, uh, having that 80-90% testing scale automated is something quite far-fetched for most of the teams, until and unless you are stagnated over one product and you're just running it for years and years, which is definitely not, uh, sustainable for any business. 

So here, my question would be, like, how do you ensure that your teams are always up for exploring that side which can be automated and making sure that it's being done? So, how do you make sure? One is, of course, having the right hires in the team, but what are the processes and what are the workflows that you implement there from time to time? People are, of course, doing the manual testing also and with the existing use cases where they can automate it. They're doing that as well. 

Leigh Rathbone: It's a really good question, Kovid, and I'll just roll back in the process a little bit because for me, automation is not just the QA person's task and not even test automation. I believe that is a shared responsibility. So, quality is owned by everybody in the team and everyone plays their different parts. So for me, the automation starts right with the developers, to say, "Well, what are you automating? What are your developer checks that you're going to automate?" Because you don't want developers doing manual checks either. You want them to automate as much as they can because at the unit test level and even the integration test level, the feedback loops are really quick. So, that means the test is really cheap. So, you're getting some really good, rich feedback initially to show that nothing obvious has broken early on when a developer adds new code. So, that's the first part. So, that now, I think is industry standard. There aren't many places where developers are sat there going, "I'm not going to write any tests at all." Those days are long, long gone, Kovid. I think all, you know, modern developers that live by the modern coding principles know that they have to write automated checks.

But I think your question is targeted at the QAs. So, how do we get QAs involved? So, you have to breed the curiosity gene in people, Kovid. So, you're absolutely right. You have to bring people in who have the skills because it's very, very hard to start with a team of 10 QAs where no one can automate. That's really hard. I've only ever done that once. That's really hard. So, what I have done is I've brought people in with the mindset of thinking about automation first. The mindset of collaborating with developers to see what they're automating. The curiosity and the skill sets to grow and develop and learn more about the tools. And then, you have to give people time, Kovid. There is no way you can expect people who don't have the automation skills to just upskill like that. It's just not fair. You have to support, support, and support some more. And that comes from myself giving people time. It's understanding how people learn, Kovid.

So, I'll give you an example. Pair learning. That's one technique where you can get somebody who can't automate and maybe you get them pairing with someone else who can't automate and you give them a course. That's point number one. Pair learning could be pairing with someone who does know automation and pairing them with someone who doesn't know automation. But guess what? Not everyone likes pairing because it's quite a stressful environment for some people. Jumping on a screen and sharing your screen while you type, and them saying, "No, you've typed that wrong." That can be quite stressful. So, some people prefer to learn in isolation but they like to do a brief course first, and then come back and actually start writing something right in the moment, like taking a ticket now that they're manually testing, and doing something and practising, then getting someone to peer review it. So, not everyone likes pair learning. Not everybody likes to learn in isolation. You have to understand your people. How do they like to learn and grow? And then, you have to understand, then you have to relay to them why you are asking them to learn and grow. Why am I asking people to change? 'cause the skill bases that's needed tomorrow and the day after and in two years time are different to the skill bases we need right now or even 10 years ago. And if people don't upskill, how are they going to stay relevant? 

Kovid Batra: Right. 

Leigh Rathbone: Everything is about staying relevant, especially when test automation came along, Kovid, and people were saying, "Hey, we won't need QAs because the automation will get rid of them." And you'd be amazed how many people believed that, Kovid, you'd be absolutely gobsmacked how many tech leaders had in their minds that automation would get rid of QAs. So, we've been fighting an uphill struggle since then to justify our existence in some cases, which is wrong because I think the value addition of QAs and all the crafts when they come together is brilliant. But for me, for people who struggle to understand why they need to upskill in automation, it's the need to stay relevant and keep adding value to the company that they're in.

Kovid Batra: Makes sense. And what about, uh, the changing landscape here? So basically, uh, you have seen that part where you moved to phones and when these phones were being built, you said like, that was the first time you built something for touchscreen testing, right? Now, I think in the last five to seven years, we have seen AR, VR coming into the picture, right? Uh, the processes that you are following, let's say the pair learning and other things that you bring along to make sure that the testing piece, the quality assurance piece is in place as you grow as a company, as you grow as a tech team. For VR, AR kind of technologies, how has it changed? How has it evolved? 

Leigh Rathbone: Well, massively, because if you think about testing back in the day, everybody tested on a screen. And most of us are still doing that. And this is why this phone felt different and even the world's first smartwatch, which is here. When I tested these two things, I wasn't testing on a screen. I was wearing it on my wrist, the watch, and I was using the phone in my hand in the environment that the end user would use it in. So, when I went to PlayStation, Kovid, and I was head of European Test Operations for Europe with PlayStation, we had a number of new technologies that came in and they changed the way we had to think about testing. So, I'll give you some examples. Uh, the PlayStation Move, where you had the two controllers that can control the game, uh, VR, AR, um, 3D gaming. Those four bits of technology, and I've only reeled off four, there was more. Just in three years at PlayStation, I saw how that changed testing. So, for VR and 3D, you've got to think about health and safety of the tester. Why? Because the VR has bugs in it, the 3D has bugs in it, so it makes the tester disorientated. You're wearing.. They're not doing stuff through their eyes, their true eyes, they're doing it through a screen that has bugs in it, but the screen is right up and close to their eyes. So there was motion sickness to think about. And then, of course, there was the physical space that the testers were in. You can't test VR sat at a desk, you have to stand up. Look, because that's how the end users do it. When we tested the PlayStation Move with the two controllers, we had to build physical lounges for testers to then go into to test the Move because that's how gamers were going to use the game. Uh, I remember Microsoft saying that they actually went and bought a load of prompts for the Kinect. Um, so wigs and blow-up bodies to mimic different shapes of people's bodies because the camera needed to pick up everybody's style of hair, whether you're bald like me, or whether you have an afro, the camera needed to be able to pick you up. So all of a sudden, the whole testing dynamics have changed from just being 2 plus 2 equals 4 in a field, to actually can the camera recognize a bald, fat person playing the game. 

Everything changed. And this is what I mean. Now, it's performance. Uh, for VR, for augmented reality, mixed reality glasses, there's gonna be usability, there's gonna be performance, there's gonna be security. I'll give you one example if I can, Kovid. I'm walking down the road, and I'm wearing, uh, mixed reality glasses, and there's a person coming towards me in a t-shirt that I like, and all of a sudden, my pupils dilate, a bit of sweat comes out my wrist, That's data. That's collected by the wearable tech and the glasses. They know that I like that t-shirt. All of a sudden, at the top right-hand corner of those glasses, a picture of me wearing that t-shirt appears, and a voice appears on the arm and goes, "Would you like to purchase?" And I say, "Yes." And a purchase is made with no interaction with the phone. No interaction with anything except me saying 'yes' to a picture that appeared in the top right-hand corner of my phone. Performance was key there. Security was really key because there's a transaction of payments that's taken place. And usability, Kovid. If that picture appeared in the middle of the glasses, and appeared on both glasses, I'm walking out into the road in front of a bus, the bus is going to hit me, bang I'm dead because of usability. So, the world is changing how we need to think about quality and the user's experience with mixed reality, VR, AR is changed overnight.

Kovid Batra: I think I would like to go back to the point where you mentioned automation replacing humans, right? Uh, and that was a problem. And of course, that's not the reality, that cannot happen, but can we just deep dive into the testing and QA space itself and highlight what exactly today humans are doing that automation cannot replace? 

Leigh Rathbone: Ooh! Okay. Well, first of all, I think there's some things that need to be said before we answer that, Kovid. So, what's in your head? So, when I think of AI, when I think of companies, and this does answer your question, actually, every company that I've been into, and I've already mentioned that I've worked in a lot, the culture, the people, the tech stack, the customers, when you combine all of those together for each company, they're unique, absolutely unique to every single company. When you blend all of those and the culture and make a little pot of ingredients as to what that company is, it's unique. So, I think point number one is I think AI will always help and assist and is a tool to help everyone, but we have to remember, every company is unique and AI doesn't know that. So, AI is not a human being. AI is not creative. I think AI should be seen as a member of the team. Now if we took that mindset, would we invite everybody who's a member of the team into a meeting, into an agile ceremony, and then ignore one member of that team? We wouldn't, would we? So, AI is a tool and if we see it as a member of the team, not a human being, but a member of the team, why wouldn't we ask AI its opinion with everything that we do as QAs, but as an Agile team? So if we roll right back, even before a feature or an epic gets written, you can use AI for research. It's a member of the team. What do you think? What happened previously? It can give you trends. It can give you trends on bugs with previous projects that have been similar. So, you can use AI as a member of the team to help you before you even get going. What were the previous risks on this project that look similar? Then when you start getting to writing the stories, why wouldn't you ask AI its opinion? It's a member of the team. But guess what? Whatever it gives you, the team can then choose whether they want to use it, or tweak it, or not use it, just like any other member of the team. If I say this is my opinion, and I think we should write the story with this, the team might disagree. And I go, "Okay, let's go with the team." So, why don't we use AI as exactly the same, Kovid, and say, "When we're writing stories, let's ask it. In fact, let's ask it to start with 'cause it might get us into a place where we can refactor that story much quicker." Then when we write code, why aren't we as devs using AI as a member, doing pair programming with it? And if you're already pair programming with another developer, add AI as the third person to pair program with. It'll help you with writing code, spotting errors with code, peer reviews, pull requests. And then, when we come to tests, use it as a member of the team. " I'm new to learning Cypress, can you help me?" Goddamn right, it can. "I've written my first Cypress test. What have I done wrong?" It's just like asking another colleague. Right, except it's got a wider sort of knowledge base and a wider amount of parameters that it's pulling from. 

So for me, will AI replace people? Yes, absolutely. But not just in testing, not just in tech, AI has made things easily accessible to more people outside of tech as well. So, will it replace people's jobs? I'm afraid it will. Yes. But the people who survive this will be the ones who know how to use AI and treat it as a member of the team. Those people will be the last pots of people. They will be the ones who probably stay. AI will replace jobs. I don't care what people say, it will happen. Will it happen on a large scale? I don't know. And I don't think anyone does. But it will start reducing number of people in jobs, not just in tech. 

Kovid Batra: That would happen across all domains, actually. I think that that's very true. Yeah. 

So basically, I think it's more around the creativity piece, wherein if there are new use cases coming in, the AI is yet not there to make sure that you write the best, uh, test case for it and do the testing for you, or in fact, automate that piece for the coming, uh, use cases, but if the teams are using it wisely and as a team member, as you said, that's a very, very good analogy, by the way, uh, that's a great analogy. Uh, I think that's the best way to build that context for that team member so that they know what the whole journey has been while releasing an epic or a story, and then, probably they would have that creativity or that, uh, expertise to give you the use case and help you in a much better way than it could do today, like without hallucinating, without giving you results that are completely irrelevant. 

Leigh Rathbone: Yeah, I totally agree, Kovid. And I think this is, um, if you think about what companies should be doing, companies should be creating new code, new experiences for their customers, value add code. If we're just recreating old stuff, the company might not be around much longer. So, if we are creating new stuff, and let's make an assumption that, I don't know, 50 percent of code is actually new stuff that's never been out there before. Well, yeah, AI is going to struggle with knowing what to do or what automation test it could be. It can have a good stab because you can set parameters and you can say, you can give it a role, as an example. So, when you're working with chatGPT, you can say, as a professional tester or as a, you know, a long-term developer, what would be my mindset on writing JavaScript code for blah, blah, blah, blah? And it will have a good stab at doing it. But if it's for a space rocket that can go 20 times the speed of light, it might struggle because no one's done that yet and put the data back into the LLM, yet. 

Kovid Batra: Totally. Totally makes sense. Great. I think, Leigh, uh, with this thought, I think we'll bring our discussion to an end for today. I loved talking to you so much and I have to really appreciate the way you explain things. Great storytelling, great explanation. And you're one of the finest ones whom I have brought on the show, probably, so I would love to have another show with you, uh, and talk and deep dive more into such topics. But for today, I think we'll have to say goodbye to you, and before we say that, I would love for you to give our audience parting advice on how they should look at software quality testing in their career. 

Leigh Rathbone: I think that that's a really good question. I think the piece of advice, regardless of what craft you're doing in tech, always try and think quality and always put the customer at the heart of what you're trying to do because too many times we create software without thinking about the customer. I'll give you one example, Kovid, as a parting gift. Anybody can go and sit in a contact centre and watch how people in contact centres work, and you'll understand the thing that I'm saying, because we never, ever create software for people who work in contact centres. We always think we're creating software that's solving their problems, but you go and watch how they work. They work at speed. They'll have about three different systems open at once. They'll have a notepad open where they're copying and pasting stuff into. What a terrible user experience. Why? Because we've never created the software with them at the heart of what we were trying to do. And that's just one example, Kovid. The world is full of software examples where we do not put the customer first. So we all own quality, put the customer front and center. 

Kovid Batra: Great. I think, uh, best advice, not just in software testing or in general or any aspect of business that you're doing, but also I think in life you have to.. So I believe in this philosophy that if you're in this world, you have to give some value to this world and you can create value only if you understand your environment, your surroundings, your people. So, to always have that empathy, that understanding of what people expect from you and what value you want to deliver. So, I really second that thought, and it's very critical to building great pieces of software, uh, in the industry also. 

Leigh Rathbone: Well, Kovid, you've got a great value there and it ties very closely with people that write code, but leaders as well. So, developers should always leave the code base in a better state than they found it. And leaders should always leave their people in a much better place than when they found them or when they came into the company. So, I think your value is really strong there. 

Kovid Batra: Thank you so much. All right, Leigh, thank you. Thank you so much for your time today. It was great, great talking to you. Talk to you soon. 

Leigh Rathbone: Thank you, Kovid. Thank you. Bye. 

Importance of DORA Metrics for Boosting Tech Team Performance

DORA metrics serve as a compass for engineering teams, optimizing development and operations processes to enhance efficiency, reliability, and continuous improvement in software delivery.

In this blog, we explore how DORA metrics boost tech team performance by providing critical insights into software development and delivery processes.

What are DORA Metrics?

DORA metrics, developed by the DevOps Research and Assessment team, are a set of key performance indicators that measure the effectiveness and efficiency of software development and delivery processes. They provide a data-driven approach to evaluate the impact of operational practices on software delivery performance.

Four Key DORA Metrics

  • Deployment Frequency: It measures how often code is deployed into production per week. 
  • Lead Time for Changes: It measures the time it takes for code changes to move from inception to deployment. 
  • Change Failure Rate: It measures the code quality released to production during software deployments.
  • Mean Time to Recover: It measures the time to recover a system or service after an incident or failure in production.

In 2021, the DORA Team added Reliability as a fifth metric. It is based upon how well the user’s expectations are met, such as availability and performance, and measures modern operational practices.

How do DORA Metrics Drive Performance Improvement for Tech Teams? 

Here’s how key DORA metrics help in boosting performance for tech teams: 

Deployment Frequency 

Deployment Frequency is used to track the rate of change in software development and to highlight potential areas for improvement. A wrong approach in the first key metric can degrade the other DORA metrics.

One deployment per week is standard. However, it also depends on the type of product.

How does it Drive Performance Improvement? 

  • Frequent deployments allow development teams to deliver new features and updates to end-users quickly. Hence, enabling them to respond to market demands and feedback promptly.
  • Regular deployments make changes smaller and more manageable. Hence, reducing the risk of errors and making identifying and fixing issues easier. 
  • Frequent releases offer continuous feedback on the software’s performance and quality. This facilitates continuous improvement and innovation.

Lead Time for Changes

Lead Time for Changes is a critical metric used to measure the efficiency and speed of software delivery. It is the duration between a code change being committed and its successful deployment to end-users. 

The standard for Lead time for Change is less than one day for elite performers and between one day and one week for high performers.

How does it Drive Performance Improvement? 

  • Shorter lead times indicate that new features and bug fixes reach customers faster. Therefore, enhancing customer satisfaction and competitive advantage.
  • Reducing lead time highlights inefficiencies in the development process, which further prompts software teams to streamline workflows and eliminate bottlenecks.
  • A shorter lead time allows teams to quickly address critical issues and adapt to changes in requirements or market conditions.

Change Failure Rate

CFR, or Change Failure Rate measures the frequency at which newly deployed changes lead to failures, glitches, or unexpected outcomes in the IT environment.

0% - 15% CFR is considered to be a good indicator of code quality.

How does it Drive Performance Improvement? 

  • A lower change failure rate highlights higher quality changes and a more stable production environment.
  • Measuring this metric helps teams identify bottlenecks in their development process and improve testing and validation practices.
  • Reducing the change failure rate enhances the confidence of both the development team and stakeholders in the reliability of deployments.

Mean Time to Recover 

MTTR, which stands for Mean Time to Recover, is a valuable metric that provides crucial insights into an engineering team's incident response and resolution capabilities.

Less than one hour is considered to be a standard for teams.  

How does it Drive Performance Improvement? 

  • Reducing MTTR boosts the overall resilience of the system. Hence, ensuring that services are restored quickly and minimizing downtime.
  • Users experience less disruption due to quick recovery from failures. This helps in maintaining customer trust and satisfaction. 
  • Tracking MTTR advocates teams to analyze failures, learn from incidents, and implement preventative measures to avoid similar issues in the future.

How to Implement DORA Metrics in Tech Teams? 

Collect the DORA Metrics 

Firstly, you need to collect DORA Metrics effectively. This can be done by integrating tools and systems to gather data on key DORA metrics. There are various DORA metrics trackers in the market that make it easier for development teams to automatically get visual insights in a single dashboard. The aim is to collect the data consistently over time to establish trends and benchmarks. 

Analyze the DORA Metrics 

The next step is to analyze them to understand your development team's performance. Start by comparing metrics to the DORA benchmarks to see if the team is an Elite, High, Medium, or Low performer. Ensure to look at the metrics holistically as improvements in one area may come at the expense of another. So, always strive for balanced improvements. Regularly review the collected metrics to identify areas that need the most improvement and prioritize them first. Don’t forget to track the metrics over time to see if the improvement efforts are working.

Drive Improvements and Foster a DevOps Culture 

Leverage the DORA metrics to drive continuous improvement in engineering practices. Discuss what’s working and what’s not and set goals to improve metric scores over time. Don’t use DORA metrics on a sole basis. Tie it with other engineering metrics to measure it holistically and experiment with changes to tools, processes, and culture. 

Encourage practices like: 

  • Implement small changes and measure their impact.
  • Share the DORA metrics transparently with the team to foster a culture of continuous improvement.
  • Promote cross-collaboration between development and operations teams.
  • Focus on learning from failures rather than assigning blame.

Typo - A Leading DORA Metrics Tracker 

Typo is a powerful tool designed specifically for tracking and analyzing DORA metrics, providing an alternative and efficient solution for development teams seeking precision in their DevOps performance measurement.

  • With pre-built integrations in the dev tool stack, the DORA dashboard provides all the relevant data flowing in within minutes.
  • It helps in deep diving and correlating different metrics to identify real-time bottlenecks, sprint delays, blocked PRs, deployment efficiency, and much more from a single dashboard.
  • The dashboard sets custom improvement goals for each team and tracks their success in real time.
  • It gives real-time visibility into a team’s KPI and lets them make informed decisions.

Conclusion

DORA metrics are not just metrics; they are strategic navigators guiding tech teams toward optimized software delivery. By focusing on key DORA metrics, tech teams can pinpoint bottlenecks and drive sustainable performance enhancements. 

The Fifth DORA Metric: Reliability

The DORA (DevOps Research and Assessment) metrics have emerged as a north star for assessing software delivery performance.  The fifth metric, Reliability is often overlooked as it was added after the original announcement of the DORA research team. 

In this blog, let’s explore Reliability and its importance for software development teams. 

What are DORA Metrics? 

DevOps Research and Assessment (DORA) metrics are a compass for engineering teams striving to optimize their development and operations processes.

In 2015, The DORA (DevOps Research and Assessment) team was founded by Gene Kim, Jez Humble, and Dr. Nicole Forsgren to evaluate and improve software development practices. The aim is to enhance the understanding of how development teams can deliver software faster, more reliably, and of higher quality.

Four key metrics are: 

  • Deployment Frequency: Deployment frequency measures the rate of change in software development and highlights potential bottlenecks. It is a key indicator of agility and efficiency. Regular deployments signify a streamlined pipeline, allowing teams to deliver features and updates faster.
  • Lead Time for Changes: Lead Time for Changes measures the time it takes for code changes to move from inception to deployment. It tracks the speed and efficiency of software delivery and offers valuable insights into the effectiveness of development processes, deployment pipelines, and release strategies.
  • Change Failure Rate: Change failure rate measures the frequency at which newly deployed changes lead to failures, glitches, or unexpected outcomes in the IT environment. It reflects the reliability and efficiency and is related to team capacity, code complexity, and process efficiency, impacting speed and quality.
  • Mean Time to Recover: Mean Time to Recover measures the average duration taken by a system or application to recover from a failure or incident. It concentrates on determining the efficiency and effectiveness of an organization's incident response and resolution procedures.

What is Reliability?

Reliability is a fifth metric that was added by the DORA team in 2021. It is based upon how well your user’s expectations are met, such as availability and performance, and measures modern operational practices. It doesn’t have standard quantifiable targets for performance levels rather it depends upon service level indicators or service level objectives. 

While the first four DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recover) target speed and efficiency, reliability focuses on system health, production readiness, and stability for delivering software products.  

Reliability comprises various metrics used to assess operational performance including availability, latency, performance, and scalability that measure user-facing behavior, software SLAs, performance targets, and error budgets. It has a substantial impact on customer retention and success. 

Indicators to Follow when Measuring Reliability

A few indicators include:

  • Availability: How long the software was available without incurring any downtime.
  • Error Rates: Number of times software fails or produces incorrect results in a given period. 
  • Mean Time Between Failures (MTBF): The average time passes between software breakdowns or failures. 
  • Mean Time to Recover (MTTR): The average time it takes for the software to recover from a failure. 

These metrics provide a holistic view of software reliability by measuring different aspects such as failure frequency, downtime, and the ability to quickly restore service. Tracking these few indicators can help identify reliability issues, meet service level agreements, and enhance the software’s overall quality and stability. 

Impact of Reliability on Overall DevOps Performance 

The fifth DevOps metric, Reliability, significantly impacts overall performance. Here are a few ways: 

Enhances Customer Experience

Tracking reliability metrics like uptime, error rates, and mean time to recovery allows DevOps teams to proactively identify and address issues. Therefore, ensuring a positive customer experience and meeting their expectations. 

Increases Operational Efficiency

Automating monitoring, incident response, and recovery processes helps DevOps teams to focus more on innovation and delivering new features rather than firefighting. This boosts overall operational efficiency.

Better Team Collaboration

Reliability metrics promote a culture of continuous learning and improvement. This breaks down silos between development and operations, fostering better collaboration across the entire DevOps organization.

Reduces Costs

Reliable systems experience fewer failures and less downtime, translating to lower costs for incident response, lost productivity, and customer churn. Investing in reliability metrics pays off through overall cost savings. 

Fosters Continuous Improvement

Reliability metrics offer valuable insights into system performance and bottlenecks. Continuously monitoring these metrics can help identify patterns and root causes of failures, leading to more informed decision-making and continuous improvement efforts.

Role of Reliability in Distinguishing Elite Performers from Low Performers

Importance of Reliability for Elite Performers

  • Reliability provides a more holistic view of software delivery performance. Besides capturing velocity and stability, it also takes the ability to consistently deliver reliable services to users into consideration. 
  • Elite-performing teams deploy quickly with high stability and also demonstrate strong operational reliability. They can quickly detect and resolve incidents, minimizing disruptions to the user experience.
  • Low-performing teams may struggle with reliability. This leads to more frequent incidents, longer recovery times, and overall less reliable service for customers.

Distinguishing Elite from Low Performers

  • Elite teams excel across all five DORA Metrics. 
  • Low performers may have acceptable velocity metrics but struggle with stability and reliability. This results in more incidents, longer recovery times, and an overall less reliable service.
  • The reliability metric helps identify teams that have mastered both the development and operational aspects of software delivery. 

Conclusion 

The reliability metric with the other four DORA DevOps metrics offers a more comprehensive evaluation of software delivery performance. By focusing on system health, stability, and the ability to meet user expectations, this metric provides valuable insights into operational practices and their impact on customer satisfaction. 

How to Become a Successful Platform Engineer

Platform engineering is a relatively new and evolving field in the tech industry. While it offers many opportunities, certain aspects are often overlooked.

In this blog, we discuss effective strategies for becoming a successful platform engineer and identify common pitfalls to avoid.

What is a Platform Engineer? 

Platform Engineering, an emerging technology approach, enables the software engineering team with all the required resources. This is to help them perform end-to-end operations of software development lifecycle automation. The goal is to reduce overall cognitive load, enhance operational efficiency, and remove process bottlenecks by providing a reliable and scalable platform for building, deploying, and managing applications. 

Strategies for Being a Great Platform Engineer

Keeping the Entire Engineering Organization Up-to-Date with Platform Team Insights

One important tip to becoming a great platform engineer is informing the entire engineering organization about platform team initiatives. This fosters transparency, alignment, and cross-team collaboration, ensuring everyone is on the same page. When everyone is aware of what’s happening in the platform team, it allows them to plan tasks effectively, offer feedback, raise concerns early, and minimize duplication of efforts. As a result, providing everyone a shared understanding of the platform, goals, and challenges. 

Ensure Your Team Possesses Diverse Skills

When everyone on the platform engineering team has varied skill sets, it brings a variety of perspectives and expertise to the table. This further helps in solving problems creatively and approaching challenges from multiple angles. 

It also lets the team handle a wide range of tasks such as architecture design and maintenance effectively. Furthermore, team members can also learn from each other (and so do you!) which helps in better collaboration and understanding and addressing user needs comprehensively.

Automate as much as Possible 

Pull Requests and code reviews, when done manually, take a lot of the team’s time and effort. Hence, this gives you an important reason why to use automation tools. Moreover, it allows you to focus on more strategic and high-value tasks and lets you handle an increased workload. This further helps in accelerating development cycles and time to market for new features and updates which optimizes resource utilization and reduces operational costs over time. 

Maintain a DevOps Culture 

Platform engineering isn’t all about building the underlying tools, it also signifies maintaining a DevOps culture. You must partner with development, security, and operations teams to improve efficiency and performance. This allows for having the right conversation for discovering bottlenecks, and flexibility in tool choices, and reinforces positive collaboration among teams. 

Moreover, it encourages a feedback-driven culture, where teams can continuously learn and improve. As a result, aligning the team’s efforts closely with customer requirements and business objectives. 

Stay Abreast of Industry Trends

To be a successful platform engineer, it's important to stay current with the latest trends and technologies. Attending tech workshops, webinars, and conferences is an excellent way to keep up with industry developments. besides these offline methods, you can read blogs, follow tech influencers, listen to podcasts, and join online discussions to improve your knowledge and stay ahead of industry trends.

Moreover, collaborating with a team that possesses diverse skill sets can help you identify areas that require upskilling and introduce you to new tools, frameworks, and best practices. This combined approach enables you to better anticipate and meet customer needs and expectations.

Take Everything into Consideration and Measure Success Holistically 

Beyond DevOps metrics, consider factors like security improvements, cost optimization, and consistency across the organization. This holistic approach prevents overemphasis on a single area and helps identify potential risks and issues that might be overlooked when focusing solely on individual metrics. Additionally, it highlights areas for improvement and drives ongoing optimized efficiencies across all dimensions of the platform.

Common Pitfalls that Platform Engineers Ignore 

Unable to Understand the Right Customers 

First things first, understand who your customers are. When platform teams prioritize features or improvements that don't meet software developers' needs, it negatively impacts their user experience. This can lead to poor user interfaces, inadequate documentation, and missing functionalities, directly affecting customers' productivity.

Therefore, it's essential to identify the target audience, understand their key requirements, and align with their requests. Ignoring this in the long run can result in low usage rates and a gradual erosion of customer trust.

Lack of Adequate Tools for Platform Teams

One of the common mistakes that platform engineers make is not giving engineering teams enough tooling or ownership. This makes it difficult for them to diagnose and fix issues in their code. It increases the likelihood of errors and downtime as teams may struggle to thoroughly test and monitor code. Not only this, they may also struggle to spend more time on manual processes and troubleshooting which slows down the development time cycle. 

Hence, it is always advisable to provide your team with enough tooling. Discuss with them what tooling they need, whether the existing ones are working fine, and what requirements they have. 

Too Much Planning, Not Enough Feature Releases

When a lot of time is spent on planning, it results in analysis paralysis i.e. thinking too much of potential features and improvements rather than implementation and testing. This further results in delays in deliveries, hence, slowing down the development process and feedback loops. 

Early and frequent shipping creates the right feedback loops that can enhance the user experience and improve the platform continuously. These feature releases must be prioritized based on how often certain deployment proceedings are performed. Make sure to involve the software developers as well to discover more effective solutions. 

Neglecting the Documentation Process

The documentation process is often underestimated. Platform engineers believe that the process is self-explanatory but this isn’t true. Everything around code, feature releases, and related to the platform must be comprehensively documented. This is critical for onboarding, troubleshooting, and knowledge transfer. 

Well-written documents also help in establishing and maintaining consistent practices and standards across the team. It also allows an understanding of the system’s architecture, dependencies, and known issues. 

Relying Solely on Third Party Tools for Security

Platform engineers must take full ownership of security issues. Lack of accountability can result in increased security risks and vulnerabilities specific to the platform. The limited understanding of unique risks and vulnerabilities can affect the system. 

But that doesn’t mean that platform engineers must stop using third-party tools. They must leverage them however, they need to be complemented by internal processes or knowledge and need to be integrated into the design, development, and deployment phases of platform engineering.  

Typo - An Effective Platform Engineering Tool 

Typo is an effective platform engineering tool that offers SDLC visibility, developer insights, and workflow automation to build better programs faster. It can seamlessly integrate into tech tool stacks such as GIT versioning, issue tracker, and CI/CD tools.

It also offers comprehensive insights into the deployment process through key metrics such as change failure rate, time to build, and deployment frequency. Moreover, its automated code tool helps identify issues in the code and auto-fixes them before you merge to master.

Typo has an effective sprint analysis feature that tracks and analyzes the team’s progress throughout a sprint. Besides this, It also provides 360 views of the developer experience i.e. captures qualitative insights and provides an in-depth view of the real issues.

Conclusion 

Implementing these strategies will improve your success as a platform engineer. By prioritizing transparency, diverse skill sets, automation, and a DevOps culture, you can build a robust platform that meets evolving needs efficiently. Staying updated with industry trends and taking a holistic approach to success metrics ensures continuous improvement.

Ensure to avoid the common pitfalls as well. By addressing these challenges, you create a responsive, secure, and innovative platform environment.

Hope this helps. All the best! :)

‘Team Building 101: Communication & Innovation’ with Paul Lewis, CTO at Pythian

In the latest episode of the ‘groCTO: Originals’ podcast (formerly Beyond the Code), host Kovid Batra welcomes Paul Lewis, CTO at Pythian and board member at the Schulich School of Business, who brings extensive experience from companies like Hitachi Vantara & Davis + Henderson. The topic for today’s discussion is ‘Team Building 101: Communication & Innovation’.

The episode begins with an introduction to Paul, offering insights into his background. During the discussion, Paul stresses the foundational aspects of building strong tech teams, starting with trusted leadership and clearly defining vision and technology goals. He provides strategies for fostering effective processes and communication within large, hybrid, and remote teams, and explores methods for keeping developers motivated and aligned with the broader product vision. He also shares challenges he encountered while scaling at Pythian and discusses how his teams manage the balance between tech and business goals, emphasizing the need for innovation & planning for future tech.

Lastly, Paul advises aspiring tech leaders to prioritize communication skills alongside technical skills, underscoring the pivotal role of 'code communication' in shaping successful careers.

Timestamps

  • 00:05 - Paul’s introduction
  • 02:47 - Establishing a collaborative team culture
  • 07:01 - Adapting to business objectives
  • 10:00 - Aligning developers to the basic principles of the org
  • 12:57 - Hiring & talent acquisition strategy
  • 17:31 - Processes & communication in growing teams
  • 22:15 - Communicating & imbibing team values
  • 24:33 - Challenges faced at Pythian
  • 26:00 - Aligning tech innovation with business requirements
  • 30:24 - Parting advice for aspiring tech leaders

Links and Mentions

Episode Transcript

Kovid Batra: Hi, everyone. This is Kovid, back with another episode of Beyond the Code by Typo. Today with us, we have a special guest. He has more than 25 years of engineering leadership experience. He has been a CTO for organizations like Hitachi Vantara and today, he's working as a CTO with Pythian. Welcome to the show. Great to have you here, Paul. 

Paul Lewis: Hi there. Great to be here. And sadly, it's slightly more than 30 years versus 25 years. Don't want to shame you. 

Kovid Batra: My bad. All right, Paul. So, before we dive into today's topic, by the way, today's topic, audience for you, uh, it's building tech teams from scratch. But before we move there, before we hear out Paul's thoughts on that, uh, Paul, can you give us a quick intro about yourself? Or maybe you would like to share some life-defining moments of your life. Can you just give us a quick intro there? 

Paul Lewis: Sure. Sure. So I've been doing this for a long time, as we just mentioned. Uh, but I've, I've had the privilege of seeing sort of the spectrum of technology. First 17 years in IT, like 5, 000 workloads and 29 data centers. You know, I was involved in the purchase of billions of dollars of hardware and software and services, and then moving to Hitachi, a decade of OT, right? So, I get to see what technology looks like in the real world, the impact to, uh, sort of the human side of the world and nuclear power plants and manufacturing and hospitals.

Uh, and then, the last three years at Pythian, uh, which is more cloud and data services. So, I get to see sort of the insight side of the equation and what the new innovation and technology might look like in the real future. I do spend time with academics. I'm on the board of Schulich School of Business, Masters of Technology Leadership, and I spend time with the students on Masters of Management and AI, Masters of Business Analytics. 

And then, I spend at least once a quarter with a hundred CIOs and CTOs, right? So, we talk about trends, we talk about application, we talk about innovation. So, I get to see a lot of different dimensions of the technology side. 

Kovid Batra: Oh, that's great. Thanks for that quick intro. And of course, I feel that today I'm sitting in front of somebody who has immense experience, has seen that change when there was internet coming in to the point where AI is coming in. So, I'm sure there is a lot to learn today from you. 

Paul Lewis: That sounds like a very old statement. Yes, I have used mainframe. I have used AS400. 

Kovid Batra: I have no intentions. Great, man. Great. Thank you so much. So, let's get started. I think when we are talking about building teams from scratch. I think laying the foundation is the first thing that comes to mind, like having that culture, having that vision, right? That's how you define the foundation for any strong tech team that needs to come in. So, how do you establish that? How do you establish that collaborative, positive team culture in the beginning? And how do you ensure the team aligns with the overall vision of the business, the product. So, let's hear it from you. 

Paul Lewis: Sure. Well, realistically, I don't think you start with the team and team culture. I think you start with the team leadership. I know as recent in the last three years, when we built out the Pythian software engineering practice, well, I started by bringing in somebody who's worked for me and with me for 15 years, right, somebody who I trusted, who has an enterprise perspective of maturity, who I knew had a detailed implementation of building software, who has built, you know, hundreds of millions of dollars worth of software over a period of time, and I knew could determine what skill set was necessary. But in combination with that person, I also needed sort of the program management side because this practice didn't exist, there was no sense of communications or project agility or even project management capability. So, I had to bring in sort of a program management leadership and software delivery leadership, and then start the practice of building the team. And of course, it always starts with, well, what are we actually building? You can't just hire in technologists assuming that they'll be able to build everything. It's saying, what's our technology goal? What's our technology principles? What do we think the technology strategy should be to implement? You know, whatever software we think we want to build. And from that, we can say, well, we need at least, you know, these five different skill sets and let's bring in those five skill sets in order to coordinate sort of the creation of at the very least, you know, the estimates, the foundation of the software. 

Kovid Batra: Makes sense So, I think when you say bringing in that right leadership that's the first step. But then, with that leadership, your thought is to bring in a certain type of personality that would create the culture or you need people who align with your thoughts as a CTO, and then you bring those people in? I think I would want to understand that. 

Paul Lewis: I'm less sure you need to decide between the two. I know my choices usually are bringing in somebody to which already knows how to manage me. Right? As you can imagine, CTOs, CIOs have personalities and those personalities sometimes could be straining, sometimes can be motivational, sometimes could be inspirational, but I knew I need to bring somebody in that didn't have to, that already knew how to communicate with me effectively, that I already know knew my sort of expectations of delivery, expectations of quality, expectations of timeliness, expectations of adhering to technology principles and technology maturity. So, they passed that gate, right? So now, I had sort of right out of the gate trust between me and the leadership that was going to deliver on the software which is sort of the first requirement. From then, then I expect them to both evolve the maturity of the practice, in other words, the documentation and technology principles and build out the team itself from scratch. 

So, determine what five skills are necessary and then acquire those skills and bring them into the organization. It doesn't necessarily mean hiring. In fact, the vast majority of the software, which I've built over the time, we started with partnerships with ecosystems, right? So, ecosystems of QA partnerships and development partnerships. Bring those skill sets in and as we determine, we need sort of permanent Pythian resources like software architecture resources or DevOps architecture resources or, you know, skilled senior development that we start to hire them in our organization as being the primary decision-makers and primary implementers of technology. 

Kovid Batra: Makes sense. And, uh, Paul, does this change with the type of business the org is into or you look at it from a single lens, like if the tech team is there, it has to function like this, uh, does it change with the business or not? 

Paul Lewis: I think it changes based on the business objectives. So, some businesses are highly regulated and therefore, quality might be more important than others. The reality is, you know, the three triangles of time, cost, and quality. For the most part, quality is the most fungible, right? There are industries where I'm landing a plane where quality needs to be, you know, near zero bugs and then tech startups where there's an assumption that there'll be severe, if not damaging bugs in production, right, cause I'm trying to deploy a highly agile environment. So, yes, different organizations have different sort of, uh, appetites for time, cost, and quality. Quality being the biggest measure that you can sort of change the scale on. And the smaller the organization, the more agile it becomes, the more likelihood that you can do things quickly with, I'll call it less maturity, out of the gate, and assume that you can grow maturity over time. 

So, Pythian is an example. Out of the gate, we had a relatively zero sense of maturity, right? No documentation, no process, no real sort of project management implementation. It was really smart people doing really good work. Um, and then we said, "Wow, that's interesting. That's kind of a superhero implementation which just has no longevity to it because those superheroes could easily win the lottery and move on." Right? So, we had to think about, well, how do we move away from the superhero implementation to the repeatable scalable implementation, and that requires process, and I know development isn't a big fan of process holistically, but they are a fan of consistency, right? They are a fan of proper decision-making. They are a fan of documented decisions so that the next person who's auditing or reviewing or updating the code knows the purpose and value of that particular code, right? So, some things they enjoy, some things they don't, uh, but we can improve that maturity over time. So, I can say every year we want to go from 0 to 1, 1 to 2, 2 to 3, never to pass 3, right, because we're not, like Pythian, for example, isn't a bank, right, isn't an insurance company, isn't a telco, we're not landing planes, we're not solving, uh, we're not solving complex, uh, healthcare issues, so we don't have to be as mature as any one of those organizations, but we need to have documents at least, right, we need to ensure that we have automation, automated procedures to push to production instead of direct access, DBA access to the database in a production environment. So, that's kind of the evolution that we had. 

Kovid Batra: So, amazing to hear these kinds of thoughts and I'm just trying to capture how you are enabling your developers or how you are ensuring that your developers, your teams are aligned with a similar kind of thought. What's your style of communicating and imbibing that in the team? 

Paul Lewis: We like to do that with technology principles, written technology principles. So, think of it as a, you know, top 10 what the CTO thinks is the most important when we build software. Right? So what the top 10 things are, let's mutually agree that automation is key for everything we do, right? So, automation to move code, automation to implement code, uh, automation to test, automation in terms of reporting, but that's key. Top 10 is also we need to sort of implement security by design. We need to make sure that, um, it has a secure foundation because it's not just internal users, but we're deploying the software to 2,000 endpoints, and therefore, I need to appreciate endpoints to which I don't control, there I need, therefore I need a sort of a zero trust implementation. I need to make sure that I'm using current technology standards and architectural patterns, right? I want to make sure that I have implemented such a framework that I can easily hire other people who would be just as interested in seeing this technology and using technology, and we want to be in many ways, a beacon to new technologies. I want the software we build to be an inspirational part of why somebody would want to come to work at Pythian because they can see us using an innovating current practical architectural standards in the cloud, as an example.

So, you know, you go through those technology principles and you say, "This is what I think an ideal software engineering outcome, set of outcomes look like. Who wants to subscribe to that?" And then, you see the volunteers putting up their hands saying, "Yeah, I believe in these principles. These principles are what I would put in place, or I would expect if I was running a team, therefore I want to join." Does that make sense? 

Kovid Batra: Yeah, definitely. And I think these are the folks who then become the leaders for the next set of people who need to like follow them on it. 

Paul Lewis: Yeah, exactly. 

Kovid Batra: It totally makes sense. 

Paul Lewis: And if you have a set of rules, you know, I use the word 'rules', you know, loosely, I really just mean principles, right? To say, "Here are the set of things we believe and want to be true even if there's different maturities to them. Yes, we want a fully automated system, but year one, we don't. Year three, we might." Right? So, they know sometimes it's a goal, sometimes it's principle, sometimes it's a requirement. Right? We're not going to implement low-quality code, right? We're not going to implement unsecured code, but if you have a team to buy in those principles, then they know it's not just the outcome of the software they're building, but it's the outcome of the practice that they're building. 

Kovid Batra: Totally, totally. And when it comes to bringing that kind of people to the team, I think one is of course enabling the existing team members to abide by that and believe in those principles, but when you're hiring, there has to be a good talent acquisition strategy there, right? You can't just go on hiring, uh, if you are scaling, like you're on a hiring spree and you're just bringing in people. So how do you keep that quality check when people are coming in, joining in from the lowest level, like from the developer point, we should say, to the highest leadership level also, like what's your strategy there? How do you ensure this team-building? 

Paul Lewis: Well, on the recruiting side, we make sure we talk about our outcomes frequently, both internally in the organization and externally to, uh, you know, the world at large. So internally, I do like a CTO 'ask me anything', right? So, that's a full, everybody's, you know, full board, everybody can access it, everybody can, and it's almost like a townhall. That's where we do a couple of things. We disclose things I'm hearing externally that might be motivating, inspiring to you. It's, "Here's how we're maturing and the outcomes we've produced in software over this quarter.", let's say. And then, we'll do a technology debate to say, "You know what, there's a new principle I need to think about, and that new principle might be generative AI. Let's all jump in and have a, you know, a reasonably interesting technology debate on the best implications and applications of that technology. So, it's almost like they get to have a group think or group input into those technology principles before we write it down and put it into the document. And then if I present that, which I do frequently externally, then I gavel, you know, whole networks of people to say, "Wow, that is an interesting set of policies. That's an interesting set of, um, sort of guiding principles. I want to participate in that." And that actually creates recruiting opportunities. I get at least 50 percent of my LinkedIn, um, sort of contributions and engagements are from people saying, "I thought what you said was interesting. That sounds like a team I want to join. Do you have openings to make that happen?" Right? So, we actually don't have in many ways a lack of opportunity, recruiting opportunity. If anything, we might have too much opportunity. But that's how we create that engagement, that excitement, that motivation, that inspiration both internally and externally. 

Kovid Batra: And usually, like when everyone is getting hired in your team like, do you handpick, like at least one round is there which you take with the developers or are you relying mostly on your leadership next in line to take that up? How does that work for your team? 

Paul Lewis: I absolutely rely on my leadership team, mostly because they're pretty seasoned and they've worked with me for a while, right? So, they fully appreciate what kind of things that I would expect. There are some exceptions, right? So if there are some key technologists to which I think will drive inspirational, motivational behavior or where they are implementing sort of the core or complex patterns that I think are important for the software. So, things like, uh, software architecture, I would absolutely be involved in the software architecture conversations and recruiting and sort of interviewing and hiring process because it's not just about sort of their technology acumen, it's also about their communication capabilities, right? They're going to be part of the architectural review board, and therefore, I need to know whether they can motivate, inspire, and persuade other parts of the organization to make these decisions, right? That they can communicate both verbally and in the written form, that when they create an architectural diagram, it's easy to understand, sort of that side, and even sort of DevOps-type architects where, you know, automation is so key in most of the software we develop and that'll lead into, you know, not just infrastructure as code, but potentially even the AI deployment of infrastructure as code, which means not only do they need to have, you know, the technical chops now, I need them to be well read on what the technical jobs are needed for tomorrow. Right? That also becomes important. So, I'll get involved in those key resources that I know will have a pretty big impact on the future of the organization versus, you know, the developers, the QAs, the BAs, the product owners, the project managers, you know, I don't necessarily involved in every part of that interview process.

Kovid Batra: Totally, totally. I think one good point you just touched upon right now is about processes and the communication bit of it. Right? So, I think that's also very critical in a team, at least in large-scale teams, because as you grow, things are going hybrid, remote, and even then, the processes are, and the communication is becoming even more critical there, right? So, in your teams, how do you take up or how do you ensure that the right processes are there? I mean, you can give some examples, like how do you ensure that teams are not overloaded or in fact, the work is rightly distributed amongst the team and they're communicating well wherever there is a cross-functional requirement to be delivered and teams are communicating well, the requirements are coming in? So, something around process and communication which you are doing, I would love to hear that. 

Paul Lewis: Good questions. I think communication is on three fronts, but I'll talk about the internal communication first, the communication within the teams. We have a relatively unique set of sort of development processes that are federated. So, think of it as there is a software engineering team that is dedicated to do software engineering work, but for scale, we get to dip into the billable or the customer practices. So, if I need to deliver an epic or a series of stories that require more than one, uh, Go developer or data engineer or DevOps practitioner, then I have the ability to dip into those resources, into those practices, assign them temporarily to these epics and stories, uh, or just a ticket that they, that I want them to deliver on and then they can deliver on them as long as it's all, as long as everybody's already been trained on how to implement the appropriate architectural frameworks and that they're subscribing to the PR process that is equivalent, both internally and externally to the team. We do that with standard agile processes, right? We do standups on a daily basis. We break down all of our epics in the stories and we deliver stories in the tickets and tickets get assigned people, like this is a standard process with standard PM, with standard architectural frameworks, standard automation, deployments, and we have specific people assigned to do most of the PRs, right? So not only PR reviews, but doing sort of code, code creation and code deployment so that, you know, I rely on the experts to do the expert work, but we can reach out into those teams when we need to reach out to those teams and they can be temporary, right? I don't need to assign somebody for an entire eight-week journey. Um, I could just assign them to a particular story to implement that work, which is great. So, I can expand any one particular stream from five people to 15 people at any one period of time. That's kind of the internal communication.

So, they do standups. We do, you know fine-tuned documentation, uh, we have a pretty prescriptive understanding of what's in the backlog and how and what we have to deliver in the backlog. We actually detail a one-year plan with multiple releases. So, we actually have a pretty detailed, we'll call it 'product roadmap' or 'project roadmap' to deliver in the year, and therefore, it's pretty predictable. Every eight weeks we're delivering something new to production, right? But that's only one of those communication patterns. The other communication patterns all to the other internal technology teams, because we're talking about, you know, six, seven hundred internal technologists, and we want them to be aware of not just things that we've successfully implemented in the past and how it's working in production, but what the future looks like and how they might want to participate in the future functions and features that we deliver on. 

But even those two communication patterns arguably isn't the most important part. The most important part might actually be the communication up. Right? So now, I have to have communication on a quarterly basis with my peers, with the CEO and the CFO to say not only how well we're spending money, how well we're achieving our technological goals and our technological maturity, but even more importantly, are we getting the gain in the organization? So, are we matching the revenue growth of the organization? Are we creating the operational efficiency that we expect to create with the software, right? Cause I have to measure what I produce based on the value created, not just because I happen to like building software, and that's arguably the most difficult part, right, to say, "I built software for a purpose, an organizational purpose. Am I achieving the organizational goals?" Much more difficult calculus as compared to, "I said I was going to do five features. I delivered five features. Let's celebrate." 

Kovid Batra: But I think that's the tricky part, right? And as you said, it's the hardest part. How do you make sure, like, as in, see, the leaders probably are communicating with the business teams and they have that visibility into how it's going to impact the product or how it's going to impact the users, but when it comes to the developers particularly, uh, who are just coding on a day-to-day basis, how do you ensure that the teams are motivated that way and they are communicating on those lines of delivering the outcomes, which the leaders also see? So, that's.. 

Paul Lewis: Well, they get to participate in the backlog prioritization, right? So, in fact, I like to have most of the development team consider themselves in many ways, owners of the software. They might not have the Product Owner title, or they might not be the Product Manager of the product, but I want them to feel like it's theirs. And therefore, I need them to participate in architectural decisions. I want them to buy-in to what the priority of the next set of features are going to be. I want to be able to celebrate with them when they do something successful, right? I want them to be on the forefront of presenting the value back to the rest of the team, which is why that second communication to the rest of the, you know, six or seven hundred technologists that they're the ones presenting what they created versus I'm the one presenting their credit. I want them to be proud of the code that they've built, proud of the feature that they've implemented and talk about it as if it's something that they, you know, had to spend a good portion of their waking hours on, right? That there was a technology innovation or R&D exercises that they had to overcome. That's kind of the best part. So, they're motivated to participate in the, um, in the prioritization, they're motivated to implement good code, and then they're motivated to present that as if it was an offering they were presenting to a board of decision-makers, right? It's almost as if they're going and getting new money to do new work, right? So, it's a dragon's den kind of world, which I think they enjoy. 

Kovid Batra: No, I think that's a great thought and I think this makes them feel accountable. This makes them feel motivated to whatever they are doing, and at the end of the day, if the developers start thinking on those lines, I think you have cracked it, probably. That's the criteria for a successful engineering, for sure. 

Apart from that, any other challenges while you were scaling, any particular example from your journey at Pythian that you felt is worth sharing with our audience here?

Paul Lewis: The challenge is always the 'what next?'. Right? So let's say, it takes 24 months to build a substantial piece of software. Part of my job, my leadership's job is to prepare for the next two years, right? So, we're in deep dive, we're in year one, we're delivering, we're halfway delivering some interesting piece of software, but I need to prep year three and year four, right? I need to have the negotiations with my peers and my leaders to say, "Once we complete this journey, what's the next big thing on the list? How are we going to articulate the value to the organization, either in growth or efficiency? How are we going to determine how we spend? Is this a $1m piece of software, or is this a $10m piece of software?" And then, you know, preparing the team for the shift between development to steady state, right? From building something to running something. And that's a pretty big mindset, as you know, right? It's no longer focused on automation of releases between dev, QA & production. It's saying, "It's in production now. It's now locked down. I need you to refocus on development on something else and let some other team run this system." So, both sides of that equation, how do I move from build to run in that team? And then, how do I prepare for the next thing that they build? 

Kovid Batra: And how do you think your tech teams contribute here? Because what needs to be built next is something that always flows in, in terms of features or stories from the product teams, right? Or other business teams, right? Here, how do you operate? Like, in your org, let's say, there is a new technology which can completely revamp the way you have been delivering value so that tech team members are open to speak and like let the business people know that this is what we could do, or it's more like only the technical goals are being set by the tech team and rest of the product goals are given by the product team. How does it work for the, for your team here? 

Paul Lewis: It's pretty mutual in fairness, right? So, when we determine sort of the feature backlog of a piece of software, there's contribution for product management, think of that as the business, right? And the technology architecture team, right? So, we mutually determine what our next bet in terms of features that will both improve the application, functionally improve the application technically. So, that's good. 

When it comes to the bigger piece of software, so we want to put this software in steady state, do minor feature adjustments instead of major feature adjustments, that's when it requires much more of a, sort of a business headspace, right? Cause it's less about technology innovation at that point. However, sometimes it is, right? Sometimes I'll get, "Hey, what are we doing for generative AI? What new software can we build to be an exemplar of generative AI?" And I can bring that to the table. So, I have my team bringing to the decision-making table of, "Here's some technology innovation that's happening in the world that I think we should apply." And then, from the business saying, "Here's a set of business problems or revenue opportunities that we can match." So now, it's a matching process. How can I match generative AI, interesting technology with, you know, acquiring a new customer segment we currently don't acquire now, right? And so, how do I sort of bring both of those worlds together and say, "Given this match program, I'm going to circle what I think is possible within the budget that we have."? 

Kovid Batra: Right. Right. And my question is exactly this, like, what exactly makes sure that the innovation on technology and the requirements from the customer end are there on the same table, same decision-making table? So, if I'm a developer in your team, even if I'm well aware of the customer needs and requirements and I've seen certain new technologies coming up, trending up, how do I make sure that my thought is put on the right table in front of the right team and members? 

Paul Lewis: Well, fortunately, like most organizations, but definitely Pythian, we've created like an architectural review board, right? So, that includes development, architecture, product management, but it's not the executive team, right? It's the real architectural practitioners and they get to have the debate, discussion on what they think is the most technologically challenging that we want to solve or innovation that we think matters or evolution of technology that we think we want to implement within our technologies, moving from, you know, an IaaS to a PaaS to a Saas, as an example, those are all decisions that in many ways we let the technology practitioners make, and then they bring that set of decisions to the business to say, "Well, let's match this set of architectural principles with a bunch of business problems." Right? So, it's not top-down. It's not me saying, "Thou shalt build software. Thou shalt use Gen AI. Make it so." It rarely is that. It's the technology principle saying, "We think this innovation is occurring. It's a trend. It's important. We think we should apply it knowing its implications. Let's match that to one of a hundred business problems to which we know the business has, right? The reality is the business has an endless amount of technology problems, of business problems. Technology has an endless amount of innovation, right? 

Kovid Batra: Yeah, yeah. 

Paul Lewis: There's no shortlist in either of those equations. 

Kovid Batra: Correct. Absolutely. Perfect, perfect. I think this was great. I think I can go on talking with you. Uh, this is so interesting, but we'll take a hard stop here for today's episode and thank you so much for taking out time and sharing these thoughts with us, Paul. I would love to have you on another show with us, talking about many more problems of engineering teams. 

Paul Lewis: Sure. 

Kovid Batra: But thanks for today and it was great meeting you. Before you leave, um, is there a parting advice for our audience who are mostly like aspiring engineering managers, engineering leaders of the modern tech world? 

Paul Lewis: Um, the gap with most technologists, because they tend to be, you know, put their glasses on, close the lights in the room, focus on the code, that's amazing. But the best part of the thing you develop is the communication part. So, don't be just a 'code creator', be a 'code communicator'. That's the most important part of your career as a developer, is to present that wares that you just built outside of your own headspace. That's what makes the difference between a junior, an intermediate, senior developer and architect. So, think about that. 

Kovid Batra: Great, great piece of advice, Paul. Thank you so much. With that, we'll say, have a great evening. Have a great day and see you soon! 

Paul Lewis: Thank you.

What Lies Ahead: Predictions for DORA Metrics in DevOps

The DevOps Research and Assessment (DORA) metrics have long served as a guiding light for organizations to evaluate and enhance their software development practices.

As we look to the future, what changes lie ahead for DORA metrics amidst evolving DevOps trends? In this blog, we will explore the future landscape and strategize how businesses can stay at the forefront of innovation.

What Are DORA Metrics?

The widely used reference book for engineering leaders called Accelerate introduced the DevOps Research and Assessment (DORA) group’s four metrics, known as the DORA 4 metrics.

These metrics were developed to assist engineering teams in determining two things:

  • The characteristics of a top-performing team.
  • How does their performance compare to the rest of the industry?

Four key DevOps measurements:

Deployment Frequency

Deployment Frequency measures the frequency of deployment of code to production or releases to end-users in a given time frame. Greater deployment frequency is an indication of increased agility and the ability to respond quickly to market demands.

Lead Time for Changes

Lead Time for Changes measures the time between a commit being made and that commit making it to production. Short lead times in software development are crucial for success in today’s business environment. When changes are delivered rapidly, organizations can seize new opportunities, stay ahead of competitors, and generate more revenue.

Change Failure Rate

Change failure rate measures the proportion of deployment to production that results in degraded services. A lower change failure rate enhances user experience and builds trust by reducing failure and helping to allocate resources effectively.

Mean Time to Recover

Mean Time to Recover measures the time taken to recover from a failure, showing the team’s ability to respond to and fix issues. Optimizing MTTR aims to minimize downtime by resolving incidents through production changes and enhancing user satisfaction by reducing downtime and resolution times.

In 2021, DORA introduced Reliability as the fifth metric for assessing software delivery performance.

Reliability

It measures modern operational practices and doesn’t have standard quantifiable targets for performance levels. Reliability comprises several metrics used to assess operational performance including availability, latency, performance, and scalability that measure user-facing behavior, software SLAs, performance targets, and error budgets.

DORA Metrics and Their Role in Measuring DevOps Performance

DORA metrics play a vital role in measuring DevOps performance. It provides quantitative, actionable insights into the effectiveness of an organization’s software delivery and operational capabilities.

  • It offers specific, quantifiable indicators that measure various aspects of software development and delivery process.
  • DORA metrics align DevOps practices with broader business objectives. Metrics like high Deployment Frequency and low Lead Time indicate quick feature delivery and updates to end-users.
  • DORA metrics provide data-driven insights that support informed decision-making at all levels of the organization.
  • It tracks progress over time i.e. enabling teams to measure the effectiveness of implemented changes.
  • DORA metrics help organizations understand and mitigate the risks associated with deploying new code. Aiming to reduce Change Failure Rate and Mean Time to Restore helps software teams increase systems’ reliability and stability.
  • Continuously monitoring DORA metrics helps identify trends and patterns over time, enabling them to pinpoint inefficiencies and bottlenecks in their processes.

This further leads to:

  • Streamlines workflows and fewer failed leads to quick deployments.
  • Reduces failed rate and improved recovery time to minimize downtime and associated risks.
  • Fosters communication and collaboration between the development and operations teams.
  • Faster release and fewer disruptions contribute to a better user experience.

Key Predictions for DORA Metrics in DevOps

Increased Adoption of DORA metrics

One of the major predictions is that the use of DORA metrics in organizations will continue to rise. These metrics will broaden its horizons beyond five key metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, Mean Time to Restore, and Reliability) that focus on security, compliance, and more.

Organizations will start integrating these metrics with DevOps tools as well as tracking and reporting on these metrics to benchmark performance against industry leaders. This will allow software development teams to collect, analyze, and act on these data.

Emphasizing Observability and Monitoring

Observability and monitoring are now becoming two non-negotiable aspects of organizations. This is occurring as systems are getting more complex. This makes it challenging for them to understand the system’s state and diagnose issues without comprehensive observability.

Moreover, businesses have started relying on digital services which leads to an increase in the cost of downtime. Metrics like average detection and resolution times can pinpoint and rectify glitches in the early stages. Emphasizing these two aspects will further impact MTTR and CFR by enabling fast detection, and diagnosis of issues.

Integration with SPACE Framework

Nowadays, organizations are seeking more comprehensive and accurate metrics to measure software delivery performance. With the rise in adoption of DORA metrics, they are also said to be integrated well with the SPACE framework.

Since DORA and SPACE are both complemented in nature, integrating will provide a more holistic view. While DORA focuses on technical outcome and efficiency, the SPACE framework provides a broader perspective that incorporates aspects of developer satisfaction, collaboration, and efficiency (all about human factors). Together, they both emphasize the importance of continuous improvement and faster feedback loops.

Merging with AI and ML Advancements

AI and ML technologies are emerging. By integrating these tools with DORA metrics, development teams can leverage predictive analytics, proactively identify potential issues, and promote AI-driven decision-making.

DevOps gathers extensive data from diverse sources, which AI and ML tools can process and analyze more efficiently than manual methods. These tools enable software teams to automate decisions based on DORA metrics. For instance, if a deployment is forecasted to have a high failure rate, the tool can automatically initiate additional testing or notify the relevant team member.

Furthermore, continuous analysis of DORA metrics allows teams to pinpoint areas for improvement in the development and deployment processes. They can also create dashboards that highlight key metrics and trends.

Emphasis on Cultural Transformation

DORA metrics alone are insufficient. Engineering teams need more than tools and processes. Soon, there will be a cultural transformation emphasizing teamwork, open communication, and collective accountability for results. Factors such as team morale, collaboration across departments, and psychological safety will be as crucial as operational metrics.

Collectively, these elements will facilitate data-driven decision-making, adaptability to change, experimentation with new concepts, and fostering continuous improvement.

Focus on Security Metrics

As cyber-attacks continue to increase, security is becoming a critical concern for organizations. Hence, a significant upcoming trend is the integration of security with DORA metrics. This means not only implementing but also continually measuring and improving these security practices. Such integration aims to provide a comprehensive view of software development performance. This also allows striking a balance between speed and efficiency on one hand, and security and risk management on the other.

How to Stay Ahead of the Curve?

Stay Informed

Ensure monitoring of industry trends, research, and case studies continuously related to DORA metrics and DevOps practices.

Experiment and Implement

Don’t hesitate to pilot new DORA metrics and DevOps techniques within your organization to see what works best for your specific context.

Embrace Automation

Automate as much as possible in your software development and delivery pipeline to improve speed, reliability, and the ability to collect metrics effectively.

Collaborate across Teams

Foster collaboration between development, operations, and security teams to ensure alignment on DORA metrics goals and strategies.

Continuous Improvement

Regularly review and optimize your DORA metrics implementation based on feedback and new insights gained from data analysis.

Cultural Alignment

Promote a culture that values continuous improvement, learning, and transparency around DORA metrics to drive organizational alignment and success.

How Typo Leverages DORA Metrics?

Typo is an effective software engineering intelligence platform that offers SDLC visibility, developer insights, and workflow automation to build better programs faster. It offers comprehensive insights into the deployment process through key DORA metrics such as change failure rate, time to build, and deployment frequency.

DORA Metrics Dashboard

Typo’s DORA metrics dashboard has a user-friendly interface and robust features tailored for DevOps excellence. The dashboard pulls in data from all the sources and presents it in a visualized and detailed way to engineering leaders and the development team.

Comprehensive Visualization of Key Metrics

Typo’s dashboard provides clear and intuitive visualizations of the four key DORA metrics: Deployment Frequency, Change Failure Rate, Lead Time for Changes, and Mean Time to Restore.

Benchmarking for Context

By providing benchmarks, Typo allows teams to compare their performance against industry standards, helping them understand where they stand. It also allows the team to compare their current performance with their historical data to track improvements or identify regressions.

Find out what it takes to build reliable high-velocity dev teams

Conclusion

The rising adoption of DORA metrics in DevOps marks a significant shift towards data-driven software delivery practices. Integrating these metrics with operations, tools, and cultural frameworks enhances agility and resilience. It is crucial to stay ahead of the curve by keeping an eye on trends, embracing automation, and promoting continuous improvement to effectively harness DORA metrics to drive innovation and achieve sustained success.

How Typo Uses DORA Metrics to Boost Efficiency?

DORA metrics are a compass for engineering teams striving to optimise their development and operations processes.

Consistently tracking these metrics can lead to significant and lasting improvements in your software delivery processes and overall business performance.

Below is a detailed guide on how Typo uses DORA to improve DevOps performance and boost efficiency:

What are DORA Metrics?

In 2015, The DORA (DevOps Research and Assessment) team was founded by Gene Kim, Jez Humble and Nicole Forsgren to evaluate and improve software development practices. The aim was to improve the understanding of how organisations can deliver software faster, more reliable and of higher quality.

They developed DORA metrics that provide insights into the performance of DevOps practices and help organisations improve their software development and delivery processes. These metrics help in finding answers to these two questions:

  • How to identify organisations’ elite performers?
  • What should low performers teams must focus on?

The Four DORA Metrics

DORA metrics helps in assessing software delivery performance based on four key (or accelerate) metrics:

  • Deployment Frequency
  • Lead Time for Changes
  • Change Failure Rate
  • Mean Time to Recover

Deployment Frequency

Deployment Frequency measures the number of times that code is deployed into production. It helps in understanding team’s throughput and quantifying how much value is delivered to customers.

When organizations achieve a high Deployment Frequency, they can enjoy rapid releases without compromising the software’s robustness. This can be a powerful driver of agility and efficiency, making it an essential component for software development teams.

One deployment per week is standard. However, it also depends on the type of product.

Why is it Important?

  • It provides insights into the overall efficiency and speed of the DevOps team’s processes.
  • It helps in identifying pitfalls and areas for improvement in the software development life cycle.
  • It helps in making data-driven decisions to optimise the process.
  • It helps in understanding the impact of changes on system performance.

Lead Time for Changes

Lead Time for Changes measures the time it takes for code changes to move from inception to deployment. The measurement of this metric offers valuable insights into the effectiveness of development processes, deployment pipelines, and release strategies.

By analysing the Lead Time for Changes, development teams can identify bottlenecks in the delivery pipeline and streamline their workflows to improve software delivery’s overall speed and efficiency. Shorter lead time states that the DevOps team is more efficient in deploying code.

Why is it Important?

  • It helps organisations gather feedback and validate assumptions quickly, leading to informed decision-making and aligning software development with customer needs.
  • It helps organizations gain agility and adaptability, allowing them to swiftly respond to market changes, embrace new technologies, and meet evolving business needs.
  • It enables experimentation, learning, and continuous improvement, empowering organizations to stay competitive in dynamic environments.
  • It demands collaborative teamwork, breaking silos, fostering shared ownership, and improving communication, coordination, and efficiency.

Change Failure Rate

Change Failure Rate gauges the percentage of changes that require hot fixes or other remediation after production. It reflects the stability and reliability of the entire software development and deployment lifecycle.

By tracking CFR, teams can identify bottlenecks, flaws, or vulnerabilities in their processes, tools, or infrastructure that can negatively impact the quality, speed, and cost of software delivery.

0% — 15% CFR is considered to be a good indicator of your code quality.

Why is it Important?

  • It enhances user experience and builds trust by reducing failures.
  • It protects your business from financial risks which helps in avoiding revenue loss, customer churn, and brand damage by reducing failures.
  • It helps in allocating resources effectively and focuses on delivering new features.
  • It ensures changes are implemented smoothly and with minimal disruption.

Mean Time to Recovery

Mean Time to Recovery measures how quickly a team can bounce back from incidents or failures. It concentrates on determining the efficiency and effectiveness of an organisation’s incident response and resolution procedures.

A lower mean time to recovery is synonymous with a resilient system capable of handling challenges effectively.

The response time should be as short as possible. 24 hours is considered to be a good rule of thumb.

Why is it Important?

  • It enhances user satisfaction by reducing downtime and resolution times.
  • It mitigates the negative impacts of downtime on business operations, including financial losses, missed opportunities, and reputational damage.
  • It helps meet service level agreements (SLAs) that are vital for upholding client trust and fulfilling contractual commitments.
  • It provides valuable insights in day to day practices such as incident management, engineering team performance and helps elevate customer satisfaction.

The Fifth Metrics: Reliability

Reliability is a fifth metric that was added by the DORA team in 2021. It measures modern operational practices and doesn’t have standard quantifiable targets for performance levels.

Reliability comprises several metrics used to assess operational performance that includes availability, latency, performance and scalability that measures user-facing behaviour, software SLAs, performance targets, and error budgets.

How Typo Uses DORA to Boost Dev Efficiency?

Typo is an effective software engineering intelligence platform that offers SDLC visibility, developer insights, and workflow automation to build better programs faster. It offers comprehensive insights into the deployment process through key DORA metrics such as change failure rate, time to build, and deployment frequency.

Below is a detailed view of how Typo uses DORA to boost dev efficiency and team performance:

DORA Metrics Dashboard

Typo’s DORA metrics dashboard has a user-friendly interface and robust features tailored for DevOps excellence. This helps in identifying bottlenecks, improves collaboration between teams, optimises delivery speed and effectively communicates team’s success.

DORA metrics dashboard pulls in data from all the sources and presents in a visualised and detailed way to engineering leaders and development team.

DORA metrics helps in many ways:

  • With pre-built integrations in the dev tool stack, DORA dashboard provides all the relevant data flowing in within minutes.
  • It helps in deep diving and correlating different metrics to identify real-time bottlenecks, sprint delays, blocked PRs, deployment efficiency and much more from a single dashboard.
  • The dashboard sets custom improvement goals for each team and tracks their success in real-time.
  • It gives real-time visibility into a team’s KPI and lets them make informed decisions.

How to Build your DORA Metrics Dashboard?

Define your objectives

Firstly, define clear and measurable objectives. Consider KPIs that align with your organisational goals. Whether it’s improving deployment speed, reducing failure rates, or enhancing overall efficiency, having a well-defined set of objectives will help guide your implementation of the dashboard.

Understanding DORA metrics

Gain a deeper understanding of DORA metrics by exploring the nuances of Deployment Frequency, Lead Time, Change Failure Rate, and MTTR. Then, connect each of these metrics with your organisation’s DevOps goals to have a comprehensive understanding of how they contribute towards improving overall performance and efficiency.

Dashboard configuration

Follow specific guidelines to properly configure your dashboard. Customise the widgets to accurately represent important metrics and personalise the layout to create a clear and intuitive visualisation of your data. This ensures that your team can easily interpret the insights provided by the dashboard and take appropriate actions.

Implementing data collection mechanisms

To ensure the accuracy and reliability of your DORA Metrics, establish strong data collection mechanisms. Configure your dashboard to collect real-time data from relevant sources, so that the metrics reflect the current state of your DevOps processes.

Integrating automation tools

Integrate automation tools to optimise the performance of your DORA Metrics Dashboard.

By utilising automation for data collection, analysis, and reporting processes, you can streamline routine tasks. This will free up your team’s time and allow them to focus on making strategic decisions and improvements.

Utilising the dashboard effectively

To get the most out of your well-configured DORA Metrics Dashboard, use the insights gained to identify bottlenecks, streamline processes, and improve overall DevOps efficiency. Analyse the dashboard data regularly to drive continuous improvement initiatives and make informed decisions that will positively impact your software development lifecycle.

Comprehensive Visualization of Key Metrics

Typo’s dashboard provides clear and intuitive visualisations of the four key DORA metrics:

Deployment Frequency

It tracks how often new code is deployed to production, highlighting the team’s productivity.

By integrating with your CI/CD tool, Typo calculates Deployment Frequency by counting the number of unique production deployments within the selected time range. The workflows and repositories that align with production can be configured by you.

Cycle Time (Lead Time for Changes)

It measures the time it takes from code being committed to it being deployed in production, indicating the efficiency of the development pipeline.

In the context of Typo it is the average time all pull requests have spent in the “Coding”, “Pickup”, “Review” and “Merge” stages of the pipeline. Typo considers all the merged Pull Requests for the main/master/production branch for the selected time range and calculates the average time spent by each Pull Request in every stage of the pipeline. No open/draft Pull Requests are considered in this calculation.

Change Failure Rate

It shows the percentage of deployments causing a failure in production, reflecting the quality and stability of releases.

There are multiple ways this metric can be configured:

  • A deployment that needs a rollback or a hotfix: For such cases, any Pull Request having a title/tag/label that represents a rollback/hotfix that is merged to production can be considered as a failure.
  • A high-priority production incident: For such cases, any ticket in your Issue Tracker having a title/tag/label that represents a high-priority production incident can be considered as a failure.
  • A deployment that failed during the production workflow: For such cases, Typo can integrate with your CI/CD tool and consider any failed deployment as a failure.

To calculate the final percentage, the total number of failures are divided by the total number of deployments (this can be picked either from the Deployment PRs or from the CI/CD tool deployments).

Mean Time to Restore (MTTR)

It measures the time taken to recover from a failure, showing the team’s ability to respond to and fix issues.

The way a team tracks production failure (CFR) defines how MTTR is calculated for that team. If a team considers a production failure as :

  • Pull Request tagging to track a deployment that needs a rollback or a hotfix: In such a case, MTTR is calculated as the time between the last deployment till such a Pull Request was merged to main/master/production.
  • Tickets tagging for high-priority production incidents: In such a case, MTTR is calculated as the average time such a ticket takes from the ‘In Progress’ state to the ‘Done’ state.
  • CI/CD integration to track deployments that failed during the production workflow: In such a case, MTTR is calculated as the average time between that deployment failure to its being successfully deployed.

Benchmarking for Context

  • Industry Standards: By providing benchmarks, Typo allows teams to compare their performance against industry standards, helping them understand where they stand.
  • Historical Performance: Teams can also compare their current performance with their historical data to track improvements or identify regressions.

Find out what it takes to build reliable high-velocity dev teams:

How Does it Help Engineering Leaders?

  • Typo provides a clear, data-driven view of software development performance. It offers insights into various aspects of development and operational processes.
  • It helps in tracking progress over time. Through continuous tracking, it monitors improvements or regressions in a team’s performance.
  • It supports DevOps practices that focus on both development speed and operational stability.
  • DORA metrics help in mitigating risk. With the help of CFR and MTTR, engineering leaders can manage and lower risk, ensuring more stability and reliability associated with software changes.
  • It identifies bottlenecks and inefficiencies and pinpoints where the team is struggling such as longer lead times or high failure rates.

How Does it Help Development Teams?

  • Typo provides a clear, real-time view of a team’s performance and lets the team make informed decisions based on empirical data rather than guesswork.
  • It encourages balance between speed and quality by providing metrics that highlight both aspects.
  • It helps in predicting future performance based on historical data. This helps in better planning and resource allocation.
  • It helps in identifying potential risks early and taking proactive measures to mitigate them.

Conclusion

DORA metrics deliver crucial insights into team performance. Monitoring Change Failure Rate and Mean Time to Recovery helps leaders ensure their teams are building resilient services with minimal downtime. Similarly, keeping an eye on Deployment Frequency and Lead Time for Changes assures engineering leaders that the team is maintaining a swift pace.

Together, these metrics offer a clear picture of how well the team balances speed and quality in their workflows.

DevOps Metrics Mistakes to Avoid in 2024

As DevOps practices continue to evolve, it’s crucial for organizations to effectively measure DevOps metrics to optimize performance.

Here are a few common mistakes to avoid when measuring these metrics to ensure continuous improvement and successful outcomes:

DevOps Landscape in 2024

In 2024, the landscape of DevOps metrics continues to evolve, reflecting the growing maturity and sophistication of DevOps practices. The emphasis is to provide actionable insights into the development and operational aspects of software delivery.

The integration of AI and machine learning (ML) in DevOps has become increasingly significant in transforming how teams monitor, manage, and improve their software development and operations processes. Apart from this, observability and real-time monitoring have become critical components of modern DevOps practices in 2024. They provide deep insights into system behavior and performance and are enhanced significantly by AI and ML technologies.

Lastly, Organizations are prioritizing comprehensive, real-time, and predictive security metrics to enhance their security posture and ensure robust incident response mechanisms.

Importance of Measuring DevOps Metrics

DevOps metrics track both technical capabilities and team processes. They reveal the performance of a DevOps software development pipeline and help to identify and remove any bottlenecks in the process in the early stages.

Below are a few benefits of measuring DevOps metrics:

  • Metrics enable teams to identify bottlenecks, inefficiencies, and areas for improvement. By continuously monitoring these metrics, teams can implement iterative changes and track their effectiveness.
  • DevOps metrics help in breaking down silos between development, operations, and other teams by providing a common language and set of goals. It improves transparency and visibility into the workflow and fosters better collaboration and communication.
  • Metrics ensure the team’s efforts are aligned with customer needs and expectations. Faster and more reliable releases contribute to better customer experiences and satisfaction.
  • DevOps metrics provide objective data that can be used to make informed decisions rather than relying on intuition or subjective opinions. This data-driven approach helps prioritize tasks and allocate resources effectively.
  • DevOps Metrics allows teams to set benchmarks and track progress against them. Clear goals and measurable targets motivate teams and provide a sense of achievement when milestones are reached.

Common Mistakes to Avoid when Measuring DevOps Metrics

Not Defining Clear Objectives

When clear objectives are not defined for development teams, they may measure metrics that do not directly contribute to strategic goals. This leads to scattered efforts and teams may achieve high numbers in certain metrics without realizing they are not contributing meaningfully to overall business objectives. This may also not provide actionable insights and decisions might be based on incomplete or misleading data. Lack of clear objectives makes it challenging to evaluate performance accurately and makes it unclear whether performance is meeting expectations or falling short.

Solutions

Below are a few ways to define clear objectives for DevOps metrics:

  • Start by understanding the high-level business goals. Engage with stakeholders to identify what success looks like for the organization.
  • Based on the business goals, identify specific KPIs that can measure progress towards these goals.
  • Ensure that objectives are Specific, Measurable, Achievable, Relevant, and Time-bound (SMART). For example, “Reduce the average lead time for changes from 5 days to 3 days within the next quarter.”
  • Choose metrics that directly measure progress toward the objectives.
  • Regularly review the objectives and the metrics to ensure they remain aligned with evolving business goals and market conditions. Adjust them as needed to reflect new priorities or insights.

Prioritizing Speed over Quality

Organizations usually focus on delivering products quickly rather than quality. However, speed and quality must work hand in hand. DevOps tasks must be accomplished by maintaining high standards and must be delivered to the end users on time. Due to this, the development team often faces intense pressure to deliver products or updates rapidly to stay competitive in the market. This can lead them to focus excessively on speed metrics, such as deployment frequency or lead time for changes, at the expense of quality metrics.

Solutions

  • Clearly define quality goals alongside speed goals. This involves setting targets for reliability, performance, security, and user experience metrics that are equally important as delivery speed metrics.
  • Implement continuous feedback loops throughout the DevOps process such as feedback from users, automated testing, monitoring, and post-release reviews.
  • Invest in automation and tooling that accelerates delivery as well as enhances quality. Automated testing, continuous integration, and continuous deployment (CI/CD) pipelines can help in achieving both speed and quality goals simultaneously.
  • Educate teams about the importance of balancing speed and quality in DevOps practices.
  • Regularly review and refine metrics based on the evolving needs of the organization and the feedback received from customers and stakeholders.

Tracking Too Much at Once

It is usually believed that the more metrics you track, the better you’ll understand DevOps processes. This leads to an overwhelming number of metrics, where most of them are redundant or not directly actionable. It usually occurs when there is no clear strategy or prioritization framework, leading teams to attempt to measure everything that further becomes difficult to manage and interpret. Moreover, it also results in tracking numerous metrics to show detailed performance, even if those metrics are not particularly meaningful.

Solutions

  • Identify and focus on a few key metrics that are most relevant to your business goals and DevOps objectives.
  • Align your metrics with clear objectives to ensure you are tracking the most impactful data. For example, if your goal is to improve deployment frequency and reliability, focus on metrics like deployment frequency, lead time for changes, and mean time to recovery.
  • Review the metrics you are tracking to determine their relevance and effectiveness. Remove metrics that do not provide value or are redundant.
  • Foster a culture that values the quality and relevance of metrics over the sheer quantity.
  • Use visualizations and summaries to highlight the most important data, making it easier for stakeholders to grasp the critical information without being overwhelmed by the volume of metrics.

Rewarding Performance

Engineering leaders often believe that rewarding performance will motivate developers to work harder and achieve better results. However, this is not true. Rewarding specific metrics can lead to an overemphasis on those metrics at the expense of other important aspects of work. For example, focusing solely on deployment frequency might lead to neglecting code quality or thorough testing. This can also result in short-term improvements but leads to long-term problems such as burnout, reduced intrinsic motivation, and a decline in overall quality. Due to this, developers may manipulate metrics or take shortcuts to achieve rewarded outcomes, compromising the integrity of the process and the quality of the product.

Solutions

  • Cultivate an environment where teams are motivated by the satisfaction of doing good work rather than external rewards.
  • Recognize and appreciate good work through non-monetary means such as public acknowledgment, opportunities for professional development, and increased autonomy.
  • Instead of rewarding individual performance, measure and reward team performance.
  • Encourage knowledge sharing, pair programming, and cross-functional teams to build a cooperative work environment.
  • If rewards are necessary, align them with long-term goals rather than short-term performance metrics.

Lack of Continuous Integration and Testing

Without continuous integration and testing, bugs and defects are more likely to go undetected until later stages of development or production, leading to higher costs and more effort to fix issues. It compromises the quality of the software, resulting in unreliable and unstable products that can damage the organization’s reputation. Moreover, it can result in slower progress over time due to the increased effort required to address accumulated technical debt and defects.

Solutions

  • Allocate resources to implement CI/CD pipelines and automated testing frameworks.
  • Invest in training and upskilling team members on CI/CD practices and tools.
  • Begin with small, incremental implementations of CI and testing. Gradually expand the scope as the team becomes more comfortable and proficient with the tools and processes.
  • Foster a culture that values quality and continuous improvement. Encourage collaboration between development and operations teams to ensure that CI and testing are seen as essential components of the development process.
  • Use automation to handle repetitive and time-consuming tasks such as building, testing, and deploying code. This reduces manual effort and increases efficiency.

Key DevOps Metrics to Measure

Below are a few important DevOps metrics:

Deployment Frequency

Deployment Frequency measures the frequency of code deployment to production and reflects an organization’s efficiency, reliability, and software delivery quality. It is often used to track the rate of change in software development and highlight potential areas for improvement.

Lead Time for Changes

Lead Time for Changes is a critical metric used to measure the efficiency and speed of software delivery. It is the duration between a code change being committed and its successful deployment to end-users. This metric is a good indicator of the team’s capacity, code complexity, and efficiency of the software development process.

Change Failure Rate

Change Failure Rate measures the frequency at which newly deployed changes lead to failures, glitches, or unexpected outcomes in the IT environment. It reflects the stability and reliability of the entire software development and deployment lifecycle. It is related to team capacity, code complexity, and process efficiency, impacting speed and quality.

Mean Time to Recover

Mean Time to Recover is a valuable metric that calculates the average duration taken by a system or application to recover from a failure or incident. It is an essential component of the DORA metrics and concentrates on determining the efficiency and effectiveness of an organization’s incident response and resolution procedures.

Conclusion

Optimizing DevOps practices requires avoiding common mistakes in measuring metrics. To optimize DevOps practices and enhance organizational performance, specialized tools like Typo can help simplify the measurement process. It offers customized DORA metrics and other engineering metrics that can be configured in a single dashboard.

Top Platform Engineering Tools (2024)

Platform engineering tools empower developers by enhancing their overall experience. By eliminating bottlenecks and reducing daily friction, these tools enable developers to accomplish tasks more efficiently. This efficiency translates into improved cycle times and higher productivity.

In this blog, we explore top platform engineering tools, highlighting their strengths and demonstrating how they benefit engineering teams.

What is Platform Engineering?

Platform Engineering, an emerging technology approach, enables the software engineering team with all the required resources. This is to help them perform end-to-end operations of software development lifecycle automation. The goal is to reduce overall cognitive load, enhance operational efficiency, and remove process bottlenecks by providing a reliable and scalable platform for building, deploying, and managing applications.

Importance of Platform Engineering

  • Platform engineering involves creating reusable components and standardized processes. It also automates routine tasks, such as deployment, monitoring, and scaling, to speed up the development cycle.
  • Platform engineers integrate security measures into the platform, to ensure that applications are built and deployed securely. They help ensure that the platform meets regulatory and compliance requirements.
  • It ensures efficient use of resources to balance performance and expenditure. It also provides transparency into resource usage and associated costs to help organizations make informed decisions about scaling and investment.
  • By providing tools, frameworks, and services, platform engineers empower developers to build, deploy, and manage applications more effectively.
  • A well-engineered platform allows organizations to adapt quickly to market changes, new technologies, and customer needs.

Best Platform Engineering Tools

Typo

Typo is an effective software engineering intelligence platform that offers SDLC visibility, developer insights, and workflow automation to build better programs faster. It can seamlessly integrate into tech tool stacks such as GIT versioning, issue tracker, and CI/CD tools.

It also offers comprehensive insights into the deployment process through key metrics such as change failure rate, time to build, and deployment frequency. Moreover, its automated code tool helps identify issues in the code and auto-fixes them before you merge to master.

Typo has an effective sprint analysis feature that tracks and analyzes the team’s progress throughout a sprint. Besides this, It also provides 360 views of the developer experience i.e. captures qualitative insights and provides an in-depth view of the real issues.

Kubernetes

An open-source container orchestration platform. It is used to automate deployment, scale, and manage container applications.

Kubernetes is beneficial for application packages with many containers; developers can isolate and pack container clusters to be deployed on several machines simultaneously.

Through Kubernetes, engineering leaders can create Docker containers automatically and assign them based on demands and scaling needs.
Kubernetes can also handle tasks like load balancing, scaling, and service discovery for efficient resource utilization. It also simplifies infrastructure management and allows customized CI/CD pipelines to match developers’ needs.

Jenkins

An open-source automation server and CI/CD tool. Jenkins is a self-contained Java-based program that can run out of the box.

It offers extensive plug-in systems to support building and deploying projects. It supports distributing build jobs across multiple machines which helps in handling large-scale projects efficiently. Jenkins can be seamlessly integrated with various version control systems like Git, Mercurial, and CVS and communication tools such as Slack, and JIRA.

GitHub Actions

A powerful platform engineering tool that automates software development workflows directly from GitHub.GitHub Actions can handle routine development tasks such as code compilation, testing, and packaging for standardizedizing and efficient processes.

It creates custom workflows to automate various tasks and manage blue-green deployments for smooth and controlled application deployments.

GitHub Actions allows engineering teams to easily deploy to any cloud, create tickets in Jira, or publish packages.

GitLab CI

GitLab CI automatically uses Auto DevOps to build, test, deploy, and monitor applications. It uses Docker images to define environments for running CI/CD jobs and build and publish them within pipelines. It supports parallel job execution that allows to running of multiple tasks concurrently to speed up build and test processes.

GitLab CI provides caching and artifact management capabilities to optimize build times and preserve build outputs for downstream processes. It can be integrated with various third-party applications including CircleCI, Codefresh, and YouTrack.

AWS Codepipeline

A Continuous Delivery platform provided by Amazon Web Services (AWS). AWS Codepipeline automates the release pipeline and accelerates the workflow with parallel execution.

It offers high-level visibility and control over the build, test, and deploy processes. It can be integrated with other AWS tools such as AWS Codebuild, AWS CodeDeploy, and AWS Lambda as well as third-party integrations like GitHub, Jenkins, and BitBucket.

AWS Codepipeline can also configure notifications for pipeline events to help stay informed about the deployment state.

Argo CD

A Github-based continuous deployment tool for Kubernetes application. Argo CD allows to deployment of code changes directly to Kubernetes resources.

It simplifies the management of complex application deployment and promotes a self-service approach for developers. Argo CD defines and automates the K8 cluster to suit team needs and includes multi-cluster setups for managing multiple environments.

It can seamlessly integrate with third-party tools such as Jenkins, GitHub, and Slack. Moreover, it supports multiple templates for creating Kubernetes manifests such as YAML files and Helm charts.

Azure DevOps Pipeline

A CI/CD tool offered by Microsoft Azure. It supports building, testing, and deploying applications using CI/CD pipelines within the Azure DevOps ecosystem.

Azure DevOps Pipeline lets engineering teams define complex workflows that handle tasks like compiling code, running tests, building Docker images, and deploying to various environments. It can automate the software delivery process, reducing manual intervention, and seamlessly integrates with other Azure services, such as Azure Repos, Azure Artifacts, and Azure Kubernetes Service (AKS).

Moreover, it empowers DevSecOps teams with a self-service portal for accessing tools and workflows.

Terraform

An Infrastructure as Code (IoC) tool. It is a well-known cloud-native platform in the software industry that supports multiple cloud provider and infrastructure technologies.

Terraform can quickly and efficiently manage complex infrastructure and can centralize all the infrastructures. It can seamlessly integrate with tools like Oracle Cloud, AWS, OpenStack, Google Cloud, and many more.

It can speed up the core processes the developers’ team needs to follow. Moreover, Terraform automates security based on the enforced policy as the code.

Heroku

A platform-as-a-service (PaaS) based on a managed container system. Heroku enables developers to build, run, and operate applications entirely in the cloud and automates the setup of development, staging, and production environments by configuring infrastructure, databases, and applications consistently.

It supports multiple deployment methods, including Git, GitHub integration, Docker, and Heroku CLI, and includes built-in monitoring and logging features to track application performance and diagnose issues.

Circle CI

A popular Continuous Integration/Continuous Delivery (CI/CD) tool that allows software engineering teams to build, test, and deploy software using intelligent automation. It hosts CI under the cloud-managed option.

Circle CI is GitHub-friendly and includes extensive API for customized integrations. It supports parallelism i.e. splitting tests across different containers to run as clean and separate builds. It can also be configured to run complex pipelines.

Circle CI has an in-built feature ‘Caching’. It speeds up builds by storing dependencies and other frequently-used files, reducing the need to re-download or recompile them for subsequent builds.

How to Choose the Right Platform Engineering Tools?

Know your Requirements

Understand what specific problems or challenges the tools need to solve. This could include scalability, automation, security, compliance, etc. Consider inputs from stakeholders and other relevant teams to understand their requirements and pain points.

Evaluate Core Functionalities

List out the essential features and capabilities needed in platform engineering tools. Also, the tools must integrate well with existing infrastructure, development methodologies (like Agile or DevOps), and technology stack.

Security and Compliance

Check if the tools have built-in security features or support integration with security tools for vulnerability scanning, access control, encryption, etc. The tools must comply with relevant industry regulations and standards applicable to your organization.

Documentation and Support

Check the availability and quality of documentation, tutorials, and support resources. Good support can significantly reduce downtime and troubleshooting efforts.

Flexibility

Choose tools that are flexible and adaptable to future technology trends and changes in the organization’s needs. The tools must integrate smoothly with the existing toolchain, including development frameworks, version control systems, databases, and cloud services.

Proof of Concept (PoC)

Conduct a pilot or proof of concept to test how well the tools perform in the environment. This allows them to validate their suitability before committing to full deployment.

Conclusion

Platform engineering tools play a crucial role in the IT industry by enhancing the experience of software developers. They streamline workflows, remove bottlenecks, and reduce friction within developer teams, thereby enabling more efficient task completion and fostering innovation across the software development lifecycle.

Ship reliable software faster

Sign up now and you’ll be up and running on Typo in just minutes

Sign up to get started