Doing the dishes isn’t the hardest task, but it’s always the worst chore. The burden is cumulative: Once isn’t a big deal, but you can start to feel a little batty doing them over and over again, day in and day out.
Developer toil can feel a lot like doing the dishes – except you might have an entire Costco full of dishwashers that your team is always emptying and reloading. And while the apron might say, “Kiss the cook,” no one even thanks the person doing the dishes.
“Toil” is an evocative word, but developer toil specifically refers to the manual tweaks that developers need to make (often regularly, if not continuously) to keep services running. These tasks don’t result in new code, improvements, or functionality.
Vivek Rau, an engineer at Google, introduced the term in 2017 in Google’s book on site reliability engineering. Since then, it has taken off among SREs, platform teams, and developers of all kinds because it captures an experience that is hard to articulate.
Rau named six characteristics of toil, from manual and repetitive to tactical and O(n) with service growth. Subsequent research, a collaboration with Eindhoven University of Technology and financial services company ING, found that the most common properties are repetitive, manual, and devoid of value.
While toil encompasses many characteristics, its most common feature is the repetitive, manual, and value-less nature of the work. Though toil doesn’t merely refer to work people don’t want to do, the tasks in question often still feel that way. What job doesn’t involve little repetitive annoyances?
There are two primary risks here:
Burnout: Too much toil can lead developers to burnout because they have to work too hard and too often.
Disengagement: Too much can lead developers to disengage from their work because they’re working on tasks that aren’t fulfilling or valuable.
The consequences of these risks grow as more developers across teams feel the damage. Eventually, entire teams and organizations can suffer from an accumulated hit to productivity and a hard-to-identify gap in innovation.
No developer wants to toil, but toil almost inevitably emerges anyway. Why?
The primary source of toil is the leap from one predictable service to many unpredictable services, i.e., scale.
When a developer builds and tests a service on their machine, they can only be semi-confident about how it will perform in production systemstt. What happens when 50,000 customers use it at once? What happens when all those users are coming from different regions across the world?
Once live, a developer may repeatedly return to a service to ensure it remains operational under real traffic. Though developers try their best to simulate real-world conditions, the real world always has ways of proving the simulation at least somewhat deficient.
The tricky part of toil is that if a newly deployed service immediately breaks, a full fix would be immediately and obviously necessary. There would be no toil in this case because the issue would demand substantial rework with clear value delivery. The opposite case — when the service runs smoothly and predictably in production — also results in no toil. Toil comes out of the gray zone in between.
When services work (or appear to work) when they’re first deployed but slowly fray over time, developers can get stuck fine-tuning them on an ongoing basis. This is toil but won’t appear to be burdensome to many developers. Toil really only becomes noticeable when it accumulates — when many services require endless, ongoing tweaks.
Worse, as the Eindhoven study shows, the biggest reason toil is hard to eliminate is how persistent it is. If you have ever noticed one-off toil, you’ll understand how stubborn it can be.
Over time, developers can lose hours and days in twenty-minute chunks. This “death by a thousand cuts” dynamic is why Rau writes, “Toil tends to expand if left unchecked and can quickly fill 100% of everyone’s time.”
Eventually, the overall time spent and the interruptive nature of this work can have widespread effects on productivity and focus.
Developers tend to keep a tight focus on how they use their time and are often willing to communicate to managers and team leads when the proportion of creative, high-value work is out of balance with other work. (Try interrupting a developer deep in flow state focus to ask if they want to attend a meeting; you’ll see).
This can lead managers and engineering leaders to put too much trust in developers self-reporting about their experience and productivity levels. But this can be dangerous with topics like toil because developers have a blindspot: checklists.
Developer toil demonstrates this dynamic: Toil can feel productive because developers are checking tasks off a list, but the work doesn’t provide lasting value and distracts from more engaging work.
The dishes example comes in handy again. Doing the dishes feels productive, and it provides some value, but you’re going to be doing the same dishes again come tomorrow.
The work is just necessary enough, though, that it’s easy to miss how much toil has accumulated and to not think about how you could eliminate the toil with concerted effort. (This is where the dishes analogy doesn’t work, but you can bet we’d stop doing the dishes for a few nights if it meant assembling a robot that did the dishes for us!).
Because toil tends to live beneath awareness, engineering managers that rely too much on developer surveys and developer feedback risk never knowing just how much their developers are toiling.
This is the input/output gap that makes toil so insidious. The inputs are subtle, but the outputs are severe. Let’s return to the two outputs: Burnout from too much toil and disengagement from toil taking over other work.
Burnout stood out in the Eindhoven/ING research. More papers mentioned burnout than any other negative consequence of toil.
Despite the prevalence of this risk, developers might not be able to articulate this as a primary cause. Burnout is almost always the confluence of many causes, so toil might not stand out initially, meaning toil can fade into the background while remaining a significant cause. Some developers might notice how repetitive and boring their work days are without being able to name toil in particular.
Other developers, however, might notice toil but consider it a necessary evil. These developers will be even less likely to name toil as a solvable problem or even complain about how much toil they have to do.
Over time, these developers will disengage as the ratio of creative work to menial work shifts toward toil. They might not complain, but they will eventually feel the sting of a career that is not growing like they wanted. When another company offers kinds of work that promise to be more fulfilling, they’re likely to leave.
Engineering teams relying on developer surveys and direct developer feedback alone risk being surprised by the consequences of toil. Only a quantified approach can reveal and track how much time developers spend on toil.
Engineering teams need to take both quantitative and qualitative approaches to track developer toil. They might not be able to eliminate all toil outright, but if they can reduce toil tasks to three minutes each instead of twenty, for example, they can take back a lot of time and ensure developers can context switch more smoothly.
Before you can address toil, you need to understand the scale of it on your team. To do this, perform a developer toil audit where you focus less on individual issues and complaints and more on patterns.
For example, does one developer tend to take on most of the toil? If so, they might be at higher risk for burnout. By spreading the toil, you can reduce the burden on one developer while looking for solutions.
Similarly, do increases in toil correlate with lowered sentiment from developer survey data? Toil might be affecting team morale more than the team realizes. The Eindhoven University of Technology and ING research, for example, shows that improving morale is the most common positive consequence of reducing toil.
Once you can communicate this correlation, the rate of people noticing toil might rise – making it easier to find and address.
The primary source of toil, as we said above, is the unpredictable strain of production-scale traffic. Better tests and better simulations of production environments can then be one of the best ways of reducing toil.
This work tends to be most successful after finding the patterns described above. The effort it takes to improve your test suite and configure your development environments differently is more concrete than toil. Once you can see the accumulation of toil over time, though, improving testing can feel like an easily justifiable upfront investment.
This work complements similar SRE efforts. By thinking about scalability issues earlier on, you’re essentially shifting toil left and reducing how much of it you have to do later.
As you track patterns, you’ll start to notice toil tasks that are more persistent than others. At a high level, a developer might be toiling on four services, for example, but once you consolidate all the toil work, the results can be surprising. You might find that one service accounts for 90% of the total toil time or that the same kind of tweak – adjusting permissions, for example – needs regular repetition across many different services.
This is a great cue to work with the developer to see what precisely is wrong. The developer might not realize how much time a service or issue has taken because each tweak has only taken, say, twenty minutes. When you both see the accumulated time, though, it can often become obvious that a service is buggy and requires a more lasting fix than ongoing tweaks.
The work of actually reducing toil and strengthening culture requires going far and above the obvious. Great engineering leaders look to the unstated needs below the stated wants and try to solve problems at the root.
A great engineering culture isn’t reactive, waiting for developers to complain or burn out. A great engineering culture is demonstrated by leaders who seek out problems – using quantitative and qualitative approaches – to make engineering practices better.