Most engineering teams discover the realities of on-call the same way: a critical system goes down at three in the morning, somebody is paged, that somebody is exhausted, the response is slower than it should be, and a quiet conversation begins about whether the current setup is sustainable.
For distributed teams — and increasingly, most teams are distributed — this conversation happens earlier and weighs heavier. When your engineers are split across Dubai, Issaquah, and Durgapur, on-call is not just a scheduling problem. It is a cultural problem, an organisational problem, and a technical problem all at once.
This is how we think about designing on-call for distributed teams, drawn from the engagements where we have done it well and the ones where we learned the hard way. Even if your team is not yet running 24/7, the principles below are the ones we would use the moment you do.
The follow-the-sun model is more myth than reality
The first thing most people propose for distributed on-call is a follow-the-sun rotation: engineers in each time zone cover their own working hours, and incidents are seamlessly handed off as the world turns. This sounds elegant. In practice, it almost never works as cleanly as the slides suggest.
The problems with naive follow-the-sun are several. Handoffs require shared context, which is expensive to maintain across people who are not in the same room. Engineers in one region often do not have permissions, familiarity, or confidence with systems primarily owned by another region. And the cultural overhead — "is it OK to page someone in Dubai at noon when it is the middle of their workday?" — turns every escalation into a hesitation.
The teams we have seen run distributed on-call well are not running pure follow-the-sun. They are running a hybrid: a primary on-call rotation that follows the workday for one region, and a secondary escalation path that brings in other regions only when the primary is genuinely overwhelmed or asleep.
The primary discipline is making handoffs cheap
Whatever rotation model you choose, the single most important thing you can invest in is the handoff. Handoffs are where incidents go to die — or where they get resolved smoothly because the next person picking up has everything they need.
A well-handed-off incident has:
- A clear, written summary of what is happening, in a fixed place (Slack channel, incident tracker — pick one and stick to it)
- The current state of any active mitigations, with their expected effect
- The next planned action, with timing
- The names of anyone else who has been pulled in
- Links to relevant dashboards, log queries, and runbooks
None of this is new. What is new for distributed teams is that the cost of a missing handoff item is higher, because the next person picking up cannot just walk over and ask. The discipline is to write everything down because the alternative is waking someone up to ask.
The quality of your handoffs is the ceiling on the quality of your distributed on-call. Everything else is downstream of that.
Runbooks that engineers will actually use
Every on-call setup ends up with runbooks. Most runbooks end up being either too thin to be useful or too thick to be read at three in the morning. Designing runbooks for distributed on-call is partly a writing exercise and partly an organisational one.
The structural rules we apply
A runbook that gets used has three properties. It is searchable — somebody under pressure can find it from a single keyword. It is actionable — every step is something a tired engineer can execute without judgement calls. And it is maintained — the runbook reflects how the system actually works today, not how it worked six months ago.
The third property is the hardest. Runbooks rot. The discipline we apply on our larger engineering engagements is to require a runbook update as part of the post-incident review for any incident where the existing runbook was incorrect, missing, or incomplete. This sounds bureaucratic; in practice it is the difference between runbooks that are trusted and runbooks that are ignored.
Avoiding the 3am hero culture
The most insidious failure mode in any on-call setup is the emergence of a small number of engineers who quietly carry more than their share. They respond fastest, they know the systems deepest, they get paged first because they are the ones who will fix it. Within months, they are burned out, resentful, or both. Within a year, they leave.
This pattern is even more dangerous on distributed teams, because the imbalance is harder to see. The engineer in Durgapur who is taking incidents at all hours because nobody else is awake. The engineer in Dubai who is the only one who knows the payments system. The engineer in Issaquah who has been the de facto incident commander for six months because nobody else has stepped up.
The countermeasures are not glamorous. They are:
- Rotation that is genuinely enforced. No engineer is the primary on-call more than one week in four. If the rotation cannot accommodate this, the rotation is broken and needs more people, not heroics.
- Cross-training that is taken seriously. No critical system should have only one person who can debug it under pressure. Shadow rotations, paired investigations, and runbook walkthroughs are the mechanisms that make this real.
- Honest workload tracking. How many pages did each person get last week, outside their primary on-call slot? If the answer skews heavily towards one or two people, the system needs to be rebalanced.
- A culture that does not celebrate exhaustion. "I was up all night fixing the system" should be a problem to solve, not a story to tell.
The technical foundations that make on-call survivable
On top of the cultural and process disciplines, there is a technical foundation that makes distributed on-call survivable rather than gruelling. The short version is: invest heavily in observability, alerting, and automated remediation.
Specifically:
- Alerts that mean something. Every page should correspond to a customer-visible problem or a clear precursor to one. Noisy alerts erode the trust that makes the whole system work.
- Dashboards that are usable under stress. When an engineer is paged at three in the morning, they should not have to remember which dashboard to open. There should be one starting point that links to everything else.
- Automated mitigations for known failure modes. If you have been paged five times for the same thing, the sixth time should be handled automatically.
- Post-incident reviews that drive real change. Every meaningful incident should produce at least one durable improvement — a fix, a runbook update, an alert tuning. Otherwise the incidents recur.
The setup we would design today
If we were designing a distributed on-call setup from scratch today, it would look like this:
- Primary on-call follows the regional workday. Each region has a primary rotation covering its own business hours, with at least four engineers in the rotation so no one is on more than one week in four.
- Out-of-hours coverage is by exception. Pages outside business hours go first to the primary in the next active region, then escalate to the original primary only if no one responds within a defined window.
- Handoffs are written, brief, and consistent. Same format every time. Same place every time. Nobody should be paged to ask "what is going on?"
- Runbooks are kept honest by tying their maintenance to incident reviews.
- Workload is tracked and rebalanced quarterly. No silent heroes.
- The on-call experience is treated as a product. Engineers should be exposed to it for long enough to know it, but not so long that it wears them down.
None of this is revolutionary. All of it is uncomfortable to implement, because it requires sustained organisational discipline. But the alternative is a steady erosion of your best engineers, which is the most expensive cost any team can pay.
Work with us
Have a project that needs senior engineering attention?
We work with founders and enterprise teams across Dubai, the US, and India. If something here resonates with what you're building, we'd be glad to talk.
Start a conversation →