Everything fails, that’s a fact. Not even the best system has 100% reliability, right? But then, how many times have you seen Google being unavailable? I’ve never seen that, or if for some reason Google wouldn’t load, I’d blame that on my internet connection. There’s no way it’s Google’s fault. Believe it or not, but Google doesn’t have 100% reliability, either. And if even they embrace it, you should too. Just use your error budget wisely.
What is an error budget?
You already know that 100% reliability is not something achievable, but you may still want to strive for perfection. There’s no point in that, to be honest. Extreme reliability slows down the pace of delivering features. On one hand, you may want your system to be as reliable as possible, but on the other hand, you will clearly see more value in features that contribute to your business. And honestly, a lot of technical work often doesn’t have that clear business value, so it may sometimes get pushed aside by the product owner who wants new features delivered ASAP.
But let’s get back to numbers. If 100% reliability is not doable, then maybe 99.999% is?
It may be, but keep in mind that most users are limited to the reliability of their cellular network or the device they’re using. If they’re using a device with 99% reliability, they really won’t notice a difference between 99.99% and 99.999% service reliability. High reliability is a must, extreme reliability isn’t. Instead of focusing on being as close to 100% as possible, you can think about how to recover from failure. It is known that systems often fail when there are changes but not changing anything isn’t the answer, either. Your users need some updates. What you need to know, though, is when to change.
In Google’s Site Reliability Engineering (SRE), they suggest balancing the risk of unavailability with the goals of rapid innovation and efficient service operations which further translates into service optimization and user satisfaction.
Ok, but what is this error budget thing?
When you measure service risk, you need to somehow assess the performance of the system and track improvements – and the focus is usually on unplanned downtime.
In the Site Reliability Engineering book, unplanned downtime is defined as follows: