What is Toil, and Why Are SREs Obsessed with It?
Site Reliability Engineers (SREs) love to hate toil, but what exactly is toil? And why are SREs obsessed with removing toil? In a nutshell, Site Reliability Engineering is what happens when you treat IT operations like a software problem. But… how do you treat operations like a software problem?
SRE can feel opaque, but in practice, it is the essence of engineering. In general, this means that you remove inefficiencies in one component, so that other components may perform quantifiably better. Over time, the efficiency of your operations should increase.
When software engineers write code, they want it to be simple, fast, and reliable. We refer to this as “bug and cruft” free. SREs want the same thing for operations. In the realm of operations, “cruft and bugs” can be described by one word: toil. Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. Toil is any engineering effort devoid of meaningful value.
When you deploy software, it should be reliable, secure, and observable. However, there is no such thing as 100% reliable or 100% secure. When issues occur (and they will, I guarantee it), IT Operations and software engineers need observability to identify the issue, remediate, and recover. Slowing down to find all potential issues before the release isn't a viable solution – slowing down releases sacrifices velocity.
The answer lies in automation. Well-designed and well-implemented automation helps remove toil from deploying your software. For example, automated tests in your Continuous Integration pipelines will help proactively identify software bugs and automated infrastructure provisioning reduces the engineering time necessary to stand up environments. We need to remove as much manual, repetitive, and low-return work as possible so we can focus on engineering-hard problems.
Eliminating toil also helps accelerate remediation and recovery. When things go wrong (and they will), toil acts as a roadblock. If it is Black Friday and your website's ordering system goes down, every second of downtime translates into lost sales. The fewer roadblocks on the way to recovery, the less downtime, which means fewer lost sales. Any time you can eliminate toil, your team can focus time and effort on high-value engineering tasks. Removing toil from the software development process makes the entire lifecycle quantifiably more efficient, effective, reliable, and secure. It also makes the software development experience more enjoyable which will lead to better velocity.
Zac's article originally appeared in 97 THINGS EVERY CLOUD ENGINEER SHOULD KNOW