Most engineering teams discover the realities of on-call the same way: a critical system goes down at three in the morning, somebody is paged, that somebody is exhausted, the response is slower than it should be, and a quiet conversation begins about whether the current setup is sustainable.

For distributed teams — and increasingly, most teams are distributed — this conversation happens earlier and weighs heavier. When your engineers are split across Dubai, Issaquah, and Durgapur, on-call is not just a scheduling problem. It is a cultural problem, an organisational problem, and a technical problem all at once.

This is how we think about designing on-call for distributed teams, drawn from the engagements where we have done it well and the ones where we learned the hard way. Even if your team is not yet running 24/7, the principles below are the ones we would use the moment you do.

The follow-the-sun model is more myth than reality

The first thing most people propose for distributed on-call is a follow-the-sun rotation: engineers in each time zone cover their own working hours, and incidents are seamlessly handed off as the world turns. This sounds elegant. In practice, it almost never works as cleanly as the slides suggest.

The problems with naive follow-the-sun are several. Handoffs require shared context, which is expensive to maintain across people who are not in the same room. Engineers in one region often do not have permissions, familiarity, or confidence with systems primarily owned by another region. And the cultural overhead — "is it OK to page someone in Dubai at noon when it is the middle of their workday?" — turns every escalation into a hesitation.

The teams we have seen run distributed on-call well are not running pure follow-the-sun. They are running a hybrid: a primary on-call rotation that follows the workday for one region, and a secondary escalation path that brings in other regions only when the primary is genuinely overwhelmed or asleep.

The primary discipline is making handoffs cheap

Whatever rotation model you choose, the single most important thing you can invest in is the handoff. Handoffs are where incidents go to die — or where they get resolved smoothly because the next person picking up has everything they need.

A well-handed-off incident has:

None of this is new. What is new for distributed teams is that the cost of a missing handoff item is higher, because the next person picking up cannot just walk over and ask. The discipline is to write everything down because the alternative is waking someone up to ask.

The quality of your handoffs is the ceiling on the quality of your distributed on-call. Everything else is downstream of that.

Runbooks that engineers will actually use

Every on-call setup ends up with runbooks. Most runbooks end up being either too thin to be useful or too thick to be read at three in the morning. Designing runbooks for distributed on-call is partly a writing exercise and partly an organisational one.

The structural rules we apply

A runbook that gets used has three properties. It is searchable — somebody under pressure can find it from a single keyword. It is actionable — every step is something a tired engineer can execute without judgement calls. And it is maintained — the runbook reflects how the system actually works today, not how it worked six months ago.

The third property is the hardest. Runbooks rot. The discipline we apply on our larger engineering engagements is to require a runbook update as part of the post-incident review for any incident where the existing runbook was incorrect, missing, or incomplete. This sounds bureaucratic; in practice it is the difference between runbooks that are trusted and runbooks that are ignored.

Avoiding the 3am hero culture

The most insidious failure mode in any on-call setup is the emergence of a small number of engineers who quietly carry more than their share. They respond fastest, they know the systems deepest, they get paged first because they are the ones who will fix it. Within months, they are burned out, resentful, or both. Within a year, they leave.

This pattern is even more dangerous on distributed teams, because the imbalance is harder to see. The engineer in Durgapur who is taking incidents at all hours because nobody else is awake. The engineer in Dubai who is the only one who knows the payments system. The engineer in Issaquah who has been the de facto incident commander for six months because nobody else has stepped up.

The countermeasures are not glamorous. They are:

The technical foundations that make on-call survivable

On top of the cultural and process disciplines, there is a technical foundation that makes distributed on-call survivable rather than gruelling. The short version is: invest heavily in observability, alerting, and automated remediation.

Specifically:

The setup we would design today

If we were designing a distributed on-call setup from scratch today, it would look like this:

  1. Primary on-call follows the regional workday. Each region has a primary rotation covering its own business hours, with at least four engineers in the rotation so no one is on more than one week in four.
  2. Out-of-hours coverage is by exception. Pages outside business hours go first to the primary in the next active region, then escalate to the original primary only if no one responds within a defined window.
  3. Handoffs are written, brief, and consistent. Same format every time. Same place every time. Nobody should be paged to ask "what is going on?"
  4. Runbooks are kept honest by tying their maintenance to incident reviews.
  5. Workload is tracked and rebalanced quarterly. No silent heroes.
  6. The on-call experience is treated as a product. Engineers should be exposed to it for long enough to know it, but not so long that it wears them down.

None of this is revolutionary. All of it is uncomfortable to implement, because it requires sustained organisational discipline. But the alternative is a steady erosion of your best engineers, which is the most expensive cost any team can pay.

Work with us

Have a project that needs senior engineering attention?

We work with founders and enterprise teams across Dubai, the US, and India. If something here resonates with what you're building, we'd be glad to talk.

Start a conversation →