The On-Call Rotation That Actually Works
The Problem Isn't Coverage — It's Clarity
Most small teams think their on-call problem is a headcount problem. Three engineers, seven days a week, and the math feels impossible. So they either hire, or they quietly accept that one person carries the pager forever.
Both responses miss the actual failure mode. The thing that burns people out and lets incidents slip isn't too few people. It's unclear escalation boundaries. Who owns what, when do you wake someone up, and what counts as "handled" — when those questions have no written answers, every alert becomes a judgment call at 3 a.m. Judgment calls at 3 a.m. are expensive.
A lightweight written protocol fixes this faster and cheaper than a new hire.
Why Small Teams Get On-Call Wrong
The common pattern: a team of three to six people splits the week. Maybe one person takes weekdays, another takes weekends. Maybe it rotates weekly. The schedule exists, but nothing else does.
Then an alert fires. The on-call person sees it but isn't sure if it's their domain. They check a dashboard. They ping someone in chat. That someone is asleep or at dinner. Thirty minutes pass. The on-call person guesses, applies a fix, and the incident either resolves or quietly gets worse. No one writes anything down.
Two months later, someone burns out and quits. The remaining team blames the rotation for being too heavy. But the rotation wasn't the problem. The missing piece was a shared, written understanding of what the on-call person is actually supposed to do.
What a Written Protocol Looks Like
This doesn't need to be a 40-page document. It needs to answer five questions, and it needs to be short enough that someone can re-read it in two minutes while an alert is firing.
1. What qualifies as a page vs. a notification?
Draw the line in writing. If customers can't authenticate, that's a page. If a background job queue is elevated but processing, that's a notification — check it in the morning. The specific thresholds matter less than the fact that they exist and everyone agrees on them.
2. Who owns the first 15 minutes?
The on-call person. Full stop. Their job during this window is triage: confirm the problem is real, assess severity, and decide whether to escalate. They're not expected to fix everything. They're expected to classify it.
3. When does escalation happen, and to whom?
Name the person. Literally write their name, or their role if you rotate. "If the on-call person cannot resolve or contain the issue within 15 minutes, they call [secondary]." No ambiguity. No "use your judgment." Judgment is what you're removing from the equation so tired people can act quickly.
4. What does "resolved" mean?
Does resolved mean the alert stopped firing? Does it mean a customer confirmed the fix? Does it mean a follow-up issue was filed? Define it. Otherwise incidents get half-closed and drift back.
5. When does the post-incident note get written?
Within 24 hours is a good default. Not a blameless postmortem with a formal template — just a short note: what happened, what we did, what we'd do differently. If this feels like overhead, consider that the alternative is repeating the same incident six weeks later because no one remembers the first one.
Rotation Mechanics That Survive Contact With Reality
Once the protocol exists, the schedule itself becomes simpler to manage.
Weekly rotations beat daily ones. Switching context every day is disorienting. A full week lets the on-call person settle into the rhythm and build familiarity with what's currently noisy.
Overlap handoffs matter. A 30-minute window where the outgoing and incoming on-call person are both available — even just in a shared chat thread — prevents the "I didn't know about the thing from last night" gap.
Compensate the work, not the schedule. Being on-call and sleeping through the night is different from being on-call and handling three incidents. Track incident load per rotation and rebalance if one person consistently draws the hard weeks. Time off after a bad week is more valuable than a flat stipend.
The Compounding Effect
The real payoff of a written protocol isn't any single incident going better. It's that trust accumulates. When everyone knows the rules, people actually disconnect during their off weeks. They sleep better during their on weeks because they know what's expected and what isn't. They don't second-guess whether paging the secondary will be seen as weakness.
Paradoxically, the team handles more with fewer people — not because they work harder, but because they waste less time on ambiguity.
Start With the Document, Not the Tool
Teams often reach for alerting platforms, scheduling software, or escalation automation before they have a written protocol. The tools are fine. But a tool that automates an unclear process just delivers confusion faster.
Write the five answers first. Put them somewhere everyone can find in 30 seconds. Review them once a quarter. That's the on-call rotation that actually works.
0 comments
Be the first to comment.