This article is missing the single most important question about being on call: how often you get called.
It's one thing to be on call where you get called 2-3 times a year, because you're working on a quality system where bugs get fixed more often than they get introduced. Then the pay, if any, is mostly compensation for hurting your social life.
It's another to be on call where you get called 2-3 times a week, because the organisation has decided calling you is cheaper than fixing the underlying problems. In that case, the compensation better be worth messing up your sleep cycle and upsetting your partner.
There a few very important things to consider if you’re being asked to be on call:
1. Expected response time
2. Number of times called on average
3. Average time spent working to fix a problem.
#1 and #2 are pretty obvious. These things actually eat into your life because you’re actually working.
#3 is hard to calculate how much that’s worth. If I’m required to be logged into vpn and starting to dig in within let’s say 10 minutes. That means I cannot reasonably leave my house to: eat dinner with my family, go to lowes to grab some pvc because the pipe for my sump pump started leaking, walk my dogs, take my kids outside to teach them how to ride a bike, etc.
I feel I should be compensated for having to be ready to go and not having the freedom to live my life. That in itself is an interruption.
Being on call is a burden, even if you’re not called in.
When my spouse was on call for a hospital, they had to sit with their phone in their lap when we went to the movies. I had to be prepared to Uber home because they’d need the car. It’s not fun!
The studies so far show that people are stressed a bit more when on call even when not likely to be called. I know for a fact that being on call can raise stress levels but I think there are many variables that should be considered to balance out things. For example, if pages are easily horizontally escalated to another team member (the “going to be unavailable while I watch a movie / have dinner” scenario) like I do oftentimes then stress probably drops significantly compared to the first responder having few ways to abdicate responsibilities temporarily.
Being on call for a hospital is absolutely a different experience IMO than being on-call for a software system. But at the least when it comes to healthcare having had several folks in my family tree being in healthcare in different functions it’s absolutely clear to me that a big reason for the US having a buckling healthcare system is lack of supply of doctors and nurses at the very least to cover each other and to provide better care per patient.
That's one problem with a fixed on call rate that some organisations offer. It's a hefty chunk of cash and sounds generous to the engineers. But the cost is already known and sunk up front and not proportional to the amount of call outs so the business sees it as a fixed operational expenditure rather than an appraisal of how fucked things are.
The performance metric quickly becomes how many people you still have on cover who haven't quit to work somewhere else because they are burned out.
The best (FSVO "best") on-call compensation I had was a fixed sum per standby week, for simply carrying the pager. Then, on top of that, overtime at the going rate (150%, 200%, or 300% of hourly rate) depending on when teh call-out happened, with a minimum 3h compensation for any calendar day in which a callout happened (it used to be 3h per incident, until someone had 12 call-outs that each took about 5 minutes to fix).
For partial on-call weeks, the standby comp was adjusted. For good or bad, it was a literal pager, so we could easily adjust things within the team and file paperwork afterwards, as the NOC just called the pager. The downside of that was that it required being in physical presence to hand the pager over.
Google had fantastic software quality and still had SRE teams expecting to be paged twice a week. They had that because they had tremendous software quality; they paged well before there was impact that users would care about, and proactively spent time fixing their problems. Being paged, usually during daylight hours, allowed good bugs to be filed.
What a lot of people (even some working in devops) don't get is that pages/SLA metrics are a "budget". If you never get paged and your system never goes "down" (down = you need to fix something, not necessarily down for the user), it means that you're doing something wrong. Obviously you don't want to overdo it, but if you have an oncall rotation for a service that never pages or pages so rarely, you're wasting human and engineering resources.
If that's the case, you need to reconsider if you need a devops/SRE team in the first place, if you need an oncall rotation, or maybe if you need to be more proactive in implementing/releasing/deploying new features as long as you stay within SLO budget. We've had weeks and months where we just looked at our graphs and uptime budget and go "our systems are getting worse, we need to slow down releasing and tighten up the automation", and we've also had months where our load was so light that we'd consider doing large migrations or more daring experiments (for our devs) because those also improve our service and our users' experience.
> if you have an oncall rotation for a service that never pages or pages so rarely, you're wasting human and engineering resources.
Not really.
If the company loses $20,000 per minute when the system is down, the system should be well engineered, so it rarely goes down - but it's still worth paying $700/week to have someone available in 10 minutes if it does.
I'm just going to leave this here: https://sre.google/sre-book/embracing-risk/ because I think it does a better job at explaining what I'm trying to say than I could ever do. If you're never paging and your system never goes down your error budget is too high and you're very likely wasting too many resources on stuff that you don't need, regardless of whatever oncall rotation you have behind it.
Specifically, see the "Motivation for Error Budgets" section of that article.
Events that present a risk of blowing through your error budget can still be scheduled so they don't page outside of normal hours.
If nothing is ever abnormal in your system then yes, your error budget is probably too high. But there is also big space between "nothing is ever abnormal" and "I had to get up at 2AM twice a month".
> Events that present a risk of blowing through your error budget can still be scheduled so they don't page outside of normal hours.
Right, that's why you usually schedule releases over a Tuesday to Thursday (giving you ample rollout/canary/rollback time). You don't schedule on Monday (timezone) or Friday (weekend).
> But there is also big space between "nothing is ever abnormal" and "I had to get up at 2AM twice a month".
Speaking from Google principles, you will never wake up at 2AM because we don't do overnight oncall. We have a split rotation across the globe so there's always someone within waking hours to take care of pages. The real question is what happens between 6am-9am (one timezone) and 6pm-12am (other timezone, at least for my Ireland/New York split team oncall). Obviously you don't do pushes/releases during sketchy periods like I mentioned, and you usually have a "prod freeze" during holiday period (couple of weeks between christmas and new year), but stuff fails for whatever reasons anyway.
I used to work in large datacenter deployment, we'd have sketchy disks that would fail maybe once or twice a week, we'd get paged for certain machines getting stuck in repairs because our automation would fail under certain assumptions. We'd have machines that would go down and never come back up and our automation wouldn't detect that, etc etc. These are all tricky hardware issues that can be made more robust with software, or you get better hardware (some of our old hardware was REALLY bad and would randomly die with seemingly random errors and it took us months to migrate and decommission it properly), etc. These are all problems that one way or another will surface through your SLO budget and can affect how "daring" you can be during planned migrations and new releases, but it's still stuff you need to take care of even outside of work hours.
So, yes, you don't schedule big stuff outside of work hours, but that's not the whole picture either.
Even more than with tech stacks, deferring on-call process to "Google does it this way so we should do it this way" feels like a terrible idea. There's maybe ten companies in the world that have Google's scale and needs in this regard, and even though it would probably be good for their developers for Facebook to adopt Google's processes based on what I see in this thread, they also probably won't.
The rest of us have to muddle with questions like, how do we do it if we only have 20 people and they're only in two time zones and only half of those really know how to diagnose and recover a corrupted filesystem? A Google-like approach to error-budget-centric risk management just doesn't fit into that world.
If you only have 20 guys and a corrupted filesystem is one of your potential problems, you're doing it wrong. That's why people have switched to cloud services - You pay for the ability to flatten your systems, and if you've not architected with the ability to flatten your systems, you're gonna be SOL in those situations.
Ultimately, you're resilient for what you prepare for. There's a lot of tradeoffs in spend. I get that that's an example and probably not a real pressing concern, but the point is: You shouldn't have everyone trained in everything. You should have escalation paths for everything non-obvious. You should also train your people better.
There are definitely stable systems where the operational budget is so small that the cost of a human trained on it is higher than the ops budget. This can result in one of two terrible consequences: Either nobody touches it and it gets stale, or it's viewed as being "Underdeveloped" and poorly-considered features are added.
Trying to "optimize" these systems to use more of their error budget and save operational cost results in fiascos - There was a multi-day outage (not full outage, but several full days below SLA) on a minor system while I was at Google where it boiled down to "Bad Engineer tries to justify their job and management lets them implement a poor design over the objections of everyone".
Not always that simple. 2-3 times a week is nothing! Try being on call in AWS, or any product/service at that scale. How often you get paged has less to do with your organization and more to do with the scale of your systems and business.
I’ve been a part of borg oncall at google - software that manages 90+% hardware there (and there are a lot of hardware). There were week long stretches without any pages. Dont ship garbage software and it’ll be alright at any scale.
The whole meaning of "scaling" is that you can do the same thing, but bigger. If your QoS is qualitatively different you've failed to actually scale your system. At best you've scaled a couple parts of it.
Errors in a system are correlated with usage but a large part of our jobs as engineers is to reduce that correlation very, very hard. In organizations at even low scale I’ve had horrible levels of page outs (2-3 per night typical) but it means the system is unsustainable due to burning out workers in the end or that customers simply accept the error rates basically. At sufficiently high team size scales and error rates eventually you run out of hiring people to offset attrition which is what some people are reporting for teams at Amazon and AWS I’ve seen here and there.
Another big parameter is how many people are in the on-call rotation.
If the rotation is spread only on 2 or 3 Ops in the team, well, being on-call every other week, even on reliable systems, can really suck (given you must always make yourself available).
Things can get even worse in periods with a lot of PTOs like Summer or Christmas. During these periods, if the team is small, being on-call 2 or 3 weeks in a row is not uncommon.
It's one thing to be on call where you get called 2-3 times a year, because you're working on a quality system where bugs get fixed more often than they get introduced. Then the pay, if any, is mostly compensation for hurting your social life.
It's another to be on call where you get called 2-3 times a week, because the organisation has decided calling you is cheaper than fixing the underlying problems. In that case, the compensation better be worth messing up your sleep cycle and upsetting your partner.