Everything fails, that’s a fact. Not even the best system has 100% reliability, right? But then, how many times have you seen Google being unavailable? I’ve never seen that, or if for some reason Google wouldn’t load, I’d blame that on my internet connection. There’s no way it’s Google’s fault. Believe it or not, but Google doesn’t have 100% reliability, either. And if even they embrace it, you should too. Just use your error budget wisely.
What is an error budget?
You already know that 100% reliability is not something achievable, but you may still want to strive for perfection. There’s no point in that, to be honest. Extreme reliability slows down the pace of delivering features. On one hand, you may want your system to be as reliable as possible, but on the other hand, you will clearly see more value in features that contribute to your business. And honestly, a lot of technical work often doesn’t have that clear business value, so it may sometimes get pushed aside by the product owner who wants new features delivered ASAP.
But let’s get back to numbers. If 100% reliability is not doable, then maybe 99.999% is?
It may be, but keep in mind that most users are limited to the reliability of their cellular network or the device they’re using. If they’re using a device with 99% reliability, they really won’t notice a difference between 99.99% and 99.999% service reliability. High reliability is a must, extreme reliability isn’t. Instead of focusing on being as close to 100% as possible, you can think about how to recover from failure. It is known that systems often fail when there are changes but not changing anything isn’t the answer, either. Your users need some updates. What you need to know, though, is when to change.
In Google’s Site Reliability Engineering (SRE), they suggest balancing the risk of unavailability with the goals of rapid innovation and efficient service operations which further translates into service optimization and user satisfaction.
Ok, but what is this error budget thing?
When you measure service risk, you need to somehow assess the performance of the system and track improvements – and the focus is usually on unplanned downtime.
In the Site Reliability Engineering book, unplanned downtime is defined as follows:
Unplanned downtime is captured by the desired level of service availability, usually expressed in terms of the number of “nines” we would like to provide: 99.9%, 99.99%, or 99.999% availability
The usual way of calculating service availability is by observing its uptime and unplanned downtime:
availability = uptime / (uptime + downtime)
However, in the SRE book, Google suggests to use a different metric and define availability in terms of request success rate:
availability: availability = successful requests / total requests
Measuring availability this way, you should define your Service Level Objectives (SLOs). The SLO is a number that defines how big a portion of the requests you allow to fail. This means that is you set your SLO at 99.9%, your error budget is 0.01%.
Why do you need an error budget?
In a perfect world, you shouldn’t have to make a choice between uptime and innovation. In reality, however, you just need to find a balance between those two. During the development of the product, there may be some tension between the development and SRE teams. Both of these teams are evaluated with different metrics: product development performance relies on product velocity, while SRE is measured with the reliability of a service. When one group wants to push new code as quickly as possible and the other pushes against the rate of change, it’s hard to find understanding. How to find the right balance if different perspectives mean different expectations?
The error budget is supposed to solve this issue:
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.
There are also many companies that do NOT have an SRE team. Does that mean that there’s no tension there? There sure is, but on the line of the dev team and the product owner. The development team is held accountable for their work and wants to have enough time to do their job properly and deliver reliable services, while the product owner usually pushes to have more features in an even shorter time.
So what are the steps of working with an error budget? Google describes their practices:
- Product Management defines an SLO, which sets an expectation of how much uptime the service should have per quarter.
- The actual uptime is measured by a neutral third party: our monitoring system.
- The difference between these two numbers is the “budget” of how much “unreliability” is remaining for the quarter.
- As long as the uptime measured is above the SLO—in other words, as long as there is error budget remaining—new releases can be pushed.
What are the benefits of the error budget?
The main advantage is that it helps to find the balance between reliability and innovation. And this is not easy! When there are clear conditions defined, everybody working with the service knows what to do. If the SLOs are met, new releases can be pushed. Are you close to exhausting your budget? In such a case, resources will be invested into making the system more resilient and improve its performance. It seems like you still have to choose… Can you really only have one thing or the other? Reliability or innovation? No. The error budget doesn’t have to work as an on/off technique. You can slow down new releases or roll them back when the error budget is close to being used up and then when service performance is improved, get back on track with pushing new releases at a pace you’re comfortable with.
The error budget guides decisions: if the error budget is full, the team can take more risks, but if it’s drained, the team won’t be risking as much. It’s not a strict rule that you have to obey by all means – it is a guideline that helps all team members see how the service is doing.