The On-Call Rotation That Breaks People

Every engineering organisation that runs on-call has a version of the same conversation. The rotation is too frequent. The alerts are too noisy. The incidents are too long. Engineers are waking up at 3am for things that should not have been pages. People are burning out. The team is losing good engineers who cited on-call as a reason for leaving.

The proposed solutions are almost always scheduling solutions. Increase the rotation size so each engineer is on call less frequently. Hire more people to spread the load. Pay a higher on-call stipend so the pain feels compensated. Move to a follow-the-sun model so nobody is woken up.

These solutions address the symptom while leaving the cause entirely intact. The problem is not that on-call shifts are too frequent. The problem is that the systems being operated are not operable, and the on-call rotation is where that fact becomes personally painful for engineers.

The rotation is not the problem. The rotation is the measurement instrument. It is telling you something about your systems that your dashboards are not.

What on-call actually measures

When an engineer is paged, something in the system required human attention. The page is a signal that the system could not handle a situation automatically and that the situation was significant enough to warrant waking someone up.

The frequency of pages is therefore a direct measurement of how often your systems require human intervention outside of working hours. A team that is paged eight times per on-call shift is operating systems that require human attention eight times per shift. A team that is paged once per week is operating systems that almost entirely handle themselves.

This framing matters because it locates the problem correctly. The question is not “how do we make on-call less painful for engineers.” The question is “why do our systems require this much human attention, and what would it take to reduce that.”

Teams that have answered the second question correctly have on-call rotations that are genuinely light. The rotation exists as a safety net, not as a substitute for operational maturity. Engineers are on call and almost never paged. When they are paged, it is for something genuinely significant. The experience is not pleasant but it is tolerable, because it is rare.

Teams that have not answered the second question keep rescheduling their on-call rotation while their systems continue to require the same amount of human intervention at the same inconvenient hours.

The alert that should not exist

The single highest-leverage improvement available to most on-call rotations is not adding more engineers to the rotation. It is eliminating the alerts that should never have been created.

Alerts exist on a spectrum. At one end are alerts that represent genuine emergencies requiring immediate human action: the payment service is down, the database is unreachable, customers cannot complete purchases. These alerts should exist. They should page someone. The response should be immediate.

At the other end are alerts that are monitoring system internals without a clear connection to user impact: a memory usage metric has crossed a threshold that someone set arbitrarily, a queue depth is higher than usual during a known traffic pattern, a cron job took longer than its configured timeout but completed successfully. These alerts should not page anyone. They should not exist in their current form. They are noise that trains engineers to ignore their phones, which is the worst possible outcome.

Between these extremes is a large middle category: alerts that represent real conditions but not conditions that require immediate human action in the middle of the night. The batch job failed and needs to be rerun, but the data will not be stale until morning. The cache hit rate is lower than normal, but the system is functioning correctly and the impact is latency, not unavailability. A third-party API is responding slowly, but retries are handling it and no requests are failing.

Most on-call alert fatigue comes from this middle category. The alerts are not wrong exactly. They are paging at the wrong time for the wrong response.

# A framework for categorising alerts before they go into the rotation

from enum import Enum
from dataclasses import dataclass
from typing import Optional


class AlertUrgency(Enum):
    # Wake someone up immediately. Every minute costs users.
    PAGE_IMMEDIATELY = "page_immediately"

    # Important but can wait until morning. Log and notify via Slack.
    NOTIFY_ASYNC = "notify_async"

    # Track the trend. Not actionable yet. Goes to a dashboard.
    DASHBOARD_ONLY = "dashboard_only"

    # Should not exist. Indicates a monitoring configuration problem.
    DELETE_THIS_ALERT = "delete_this_alert"


@dataclass
class AlertClassification:
    alert_name: str
    urgency: AlertUrgency
    reasoning: str
    suggested_action: str
    user_impact: Optional[str] = None


def classify_alert(alert_name: str, conditions: dict) -> AlertClassification:
    """
    Ask these questions before creating or keeping any alert:

    1. Is a user experiencing a degraded or broken experience right now?
    2. Is there a human action that would improve the situation immediately?
    3. Is that action only possible during the night, or can it wait?
    4. Has this alert fired in the last 30 days and led to a useful action?

    If the answer to (1) is no, the urgency is at most NOTIFY_ASYNC.
    If the answer to (2) is no, the alert should not page anyone.
    If the answer to (3) is "it can wait", the urgency is NOTIFY_ASYNC.
    If the answer to (4) is no, the alert should be reviewed for deletion.
    """

    user_impact = conditions.get("user_impact")
    immediately_actionable = conditions.get("immediately_actionable", False)
    can_wait_until_morning = conditions.get("can_wait_until_morning", False)
    fired_and_useful_last_30_days = conditions.get("fired_and_useful_last_30_days", True)

    if not fired_and_useful_last_30_days:
        return AlertClassification(
            alert_name=alert_name,
            urgency=AlertUrgency.DELETE_THIS_ALERT,
            reasoning="Alert has not led to useful action in 30 days",
            suggested_action="Review and delete or fundamentally redesign this alert",
        )

    if not user_impact:
        return AlertClassification(
            alert_name=alert_name,
            urgency=AlertUrgency.DASHBOARD_ONLY,
            reasoning="No direct user impact identified",
            suggested_action="Move to dashboard. Alert only when user impact is confirmed.",
        )

    if not immediately_actionable:
        return AlertClassification(
            alert_name=alert_name,
            urgency=AlertUrgency.NOTIFY_ASYNC,
            reasoning="User impact exists but no immediate human action available",
            suggested_action="Slack notification during business hours. Investigate root cause.",
            user_impact=user_impact,
        )

    if can_wait_until_morning:
        return AlertClassification(
            alert_name=alert_name,
            urgency=AlertUrgency.NOTIFY_ASYNC,
            reasoning="Actionable but impact is tolerable until morning",
            suggested_action="Slack notification. Respond at start of business day.",
            user_impact=user_impact,
        )

    return AlertClassification(
        alert_name=alert_name,
        urgency=AlertUrgency.PAGE_IMMEDIATELY,
        reasoning="User impact is active, human action required immediately",
        suggested_action="Page on-call. Respond immediately.",
        user_impact=user_impact,
    )

The exercise of applying this classification to every active alert in a production system is uncomfortable because it reveals how many alerts were created without clear answers to these questions. A rule of thumb from doing this exercise: roughly half of the alerts in most systems should be reclassified or deleted.

The incident that takes too long

Alert volume is one source of on-call pain. Incident duration is the other, and it is the one that causes more lasting damage because a long incident is not just exhausting in the moment. It impairs the engineer for the following day, accumulates over successive incidents, and produces the specific kind of burnout that comes from feeling unable to solve problems you are responsible for.

Long incidents have causes that are mostly independent of the incident itself. They are caused by the decisions made before the incident, during the design and build of the system, that determine how observable the system is and how recoverable failures are.

An engineer who is paged into an incident and can see clearly what is wrong, has the tools to fix it, and has confidence that the fix will work resolves the incident quickly. The technical skill and the system knowledge are both necessary, but they are both constrained by the quality of the observability and the quality of the recovery tooling.

# The properties of a system that resolves incidents quickly.
# These are engineering decisions made before incidents happen.

class IncidentResolutionCapability:
    """
    Score each of these before your system goes to production.
    Low scores predict long incidents. Not occasionally. Reliably.
    """

    @staticmethod
    def assess(system_name: str) -> dict:
        return {
            "detection": {
                "question": "How long between problem start and first alert?",
                "green": "Under 2 minutes via automated monitoring",
                "yellow": "2-10 minutes, some manual discovery",
                "red": "Over 10 minutes, usually discovered by users",
            },
            "diagnosis": {
                "question": "How long to identify root cause once alerted?",
                "green": "Under 5 minutes with traces and structured logs",
                "yellow": "5-30 minutes with some manual log correlation",
                "red": "Over 30 minutes, requires reading unfamiliar code",
            },
            "blast_radius": {
                "question": "Can the failure be contained to a subset of users?",
                "green": "Feature flags allow immediate traffic reduction",
                "yellow": "Can route around failure with manual config changes",
                "red": "All users affected until full fix is deployed",
            },
            "recovery": {
                "question": "How long to restore service once cause is known?",
                "green": "Under 5 minutes via documented rollback or flag toggle",
                "yellow": "5-20 minutes with some manual steps",
                "red": "Over 20 minutes, requires deployment or data fix",
            },
            "confidence": {
                "question": "How confident is the responder that their fix will work?",
                "green": "High confidence, fix is reversible if wrong",
                "yellow": "Moderate confidence, some uncertainty about side effects",
                "red": "Low confidence, fix is irreversible or poorly understood",
            },
        }

A system with green ratings across all five dimensions resolves incidents in under fifteen minutes. A system with red ratings across multiple dimensions resolves incidents in hours and sometimes does not resolve them at all during the on-call shift, leaving the outage for the morning team.

The engineer’s skill is real and it matters. It is not the bottleneck. The system’s observability and recoverability are the bottleneck, and they are set before the incident starts.

The knowledge problem that compounds

On-call rotations have a second function beyond incident response that most teams do not plan for: they are the primary mechanism by which operational knowledge is distributed across a team.

When the same two or three engineers handle every incident because they are the only ones who know the systems well enough to resolve them, the rotation has failed at this second function. The knowledge has not distributed. It has concentrated, and the rotation is being used to minimise the damage of that concentration rather than to correct it.

The engineers who hold the concentrated knowledge burn out because they handle every significant incident. The engineers who lack the knowledge experience the rotation as a source of anxiety rather than as legitimate shared responsibility. Neither group is well-served.

The fix requires accepting something uncomfortable: the engineers with the knowledge need to take themselves partially out of the incident response loop, even when that means incidents take longer to resolve. The short-term cost is higher incident duration. The long-term benefit is a team that can collectively operate its own systems.

# Structured incident shadowing to distribute knowledge.
# Not an optional learning exercise. Part of the on-call rotation.

from dataclasses import dataclass
from datetime import datetime
from typing import Optional


@dataclass
class ShadowingSession:
    incident_id: str
    primary_responder: str     # The engineer who knows the system
    shadow_responder: str      # The engineer who is learning
    started_at: datetime
    system_affected: str
    learning_objectives: list[str]
    notes: str = ""
    completed_at: Optional[datetime] = None


# The protocol during a shadowed incident:
#
# 1. Primary responder handles the incident normally.
#    Shadow observes all actions and decisions.
#
# 2. Primary narrates what they are doing and why.
#    Not a lecture. Real-time commentary.
#    "I'm checking the queue depth first because when this service
#     degrades it's usually backed up, not erroring."
#
# 3. For any non-critical diagnostic step, primary asks shadow first.
#    "What would you check next?" Shadow answers. Primary responds.
#    This is not a test. It is a conversation.
#
# 4. For recovery actions, primary explains the options before choosing.
#    "We could restart the pod or roll back the deployment.
#     I'm restarting because the logs suggest it's a memory issue,
#     not a code issue. If restart doesn't work in 5 minutes, we roll back."
#
# 5. After resolution, 15-minute debrief.
#    What happened? Why? What would shadow do differently now?
#    What questions remain?
#
# 6. Shadow documents what they learned in the runbook.
#    Documentation written by the person who just learned something
#    is more useful than documentation written by the expert.
#    The expert does not remember what was confusing.

SHADOWING_SCHEDULE = {
    "week_1": "observe without touching anything",
    "week_2": "handle diagnostics with primary available",
    "week_3": "handle full incident with primary on standby",
    "week_4": "handle independently, document everything",
}

The runbook written by a shadow responder after their first solo incident is more useful than a runbook written by the expert engineer who has handled the incident twenty times. The expert has forgotten what was confusing. The new responder documents the things that were not obvious, which are the things the next new responder will also find not obvious.

The runbook that is not a runbook

A runbook that was written when the service was launched and has not been updated since is not a runbook. It is a historical document that describes a system that no longer exists exactly and a process that has been superseded by several years of learning.

The runbook that helps at 3am has specific properties.

It starts from the alert, not from a description of the system. An engineer who has been paged knows which alert fired. They do not need an explanation of what the service does. They need a path from the alert to a resolution, written by someone who has walked that path before.

# Payment Service: High Error Rate Alert

## Start here
This alert fires when the 5-minute error rate on /api/payments exceeds 2%.

Before doing anything, check these two things:
1. Is Stripe having an incident? https://status.stripe.com
   If yes: this is external. See "Stripe Incident Response" below.
   If no: continue to diagnosis.

2. Was there a deployment in the last 2 hours?
   `kubectl rollout history deployment/payment-service -n production`
   If yes and error rate started after deployment: see "Rollback" below.
   If no recent deployment: continue to diagnosis.

## Diagnosis (5 minutes, in this order)

**Step 1: Identify which endpoint is failing**
Open the Grafana dashboard: https://grafana.internal/d/payment-errors
Look at "Error Rate by Endpoint". Is it all endpoints or specific ones?

- All endpoints failing: likely infrastructure issue. Go to Step 2a.
- Specific endpoint failing: likely code or dependency issue. Go to Step 2b.

**Step 2a: Infrastructure path**
Check database connectivity:
`kubectl exec -it deployment/payment-service -- python -c
"from app.db import engine; engine.connect(); print('DB OK')"`

If this fails: the database is the problem. Go to "Database Issues".
If this succeeds: check downstream services (Step 2b).

**Step 2b: Dependency path**
Check Stripe API connectivity:
`kubectl exec -it deployment/payment-service -- curl -s
https://api.stripe.com/v1/charges -u $STRIPE_SECRET_KEY: | head -1`

Expected: HTTP 200. Anything else indicates Stripe connectivity issue.

## Recovery options (fastest first)

**Option 1: Restart the service (2 minutes, low risk)**
Use this when: logs show memory errors or connection pool exhaustion.
`kubectl rollout restart deployment/payment-service -n production`
Watch: `kubectl rollout status deployment/payment-service -n production`
Verify: error rate should drop within 2 minutes of pod restart.

**Option 2: Roll back deployment (5 minutes, low risk)**
Use this when: problem started after a deployment.
`./scripts/rollback.sh payment-service 1`
Verify: check deployment history confirms rollback, then watch error rate.

**Option 3: Enable maintenance mode (1 minute, user impact)**
Use this when: nothing else is working and impact is growing.
`kubectl set env deployment/payment-service MAINTENANCE_MODE=true -n production`
This returns 503 with a retry-after header. Users see an error but
payments are not silently failing. Use this to buy time for diagnosis.

## Escalate if:
- Error rate is above 20% and not improving after restart
- Database is unreachable
- You have been working on this for more than 20 minutes without progress

Escalate to: [name], [phone number]

This is a runbook. It starts from the alert. It has a decision tree that is specific to the real failure modes this service has. The recovery options are ordered by speed and risk. The escalation criteria are explicit. An engineer who has never seen this service can follow it.

The runbook is also a living document. Every incident that reveals a gap in the runbook should result in an update. Not eventually. Before the incident post-mortem is closed.

The on-call tax

There is a way of thinking about on-call that the best teams have adopted and that most teams have not: on-call is a tax on technical debt.

Every decision to defer an improvement to observability, to skip writing a runbook, to accept a flaky alert rather than fix it, to build a system that is hard to roll back, accumulates as a tax that is paid by engineers during on-call shifts. The tax is paid in lost sleep, in long incidents, in the anxiety of being responsible for systems you do not fully understand.

This framing is useful because it makes the investment case for operational work concrete. Improving alert quality is not boring housekeeping. It is reducing the tax engineers pay on every on-call shift for the next three years. Writing a runbook is not optional documentation. It is buying down the incident duration that will otherwise be paid in hours of confusion at inconvenient times.

The investment in operational quality does not show up in velocity metrics. It does not appear in sprint burndown charts. It does not make the roadmap look more impressive. It appears in the on-call rotation becoming lighter over time, in engineers staying at the company instead of leaving, in incidents that resolve in fifteen minutes instead of three hours.

These outcomes are real and they compound. A team that treats operational quality as a first-class investment gets progressively better at operating its systems. A team that treats it as something that will be addressed after the next release gets progressively worse, because the systems grow and the debt accumulates and the rotation gets heavier even as headcount grows.

# Tracking on-call health over time.
# If you cannot measure it, you cannot improve it.

from dataclasses import dataclass
from datetime import datetime
from statistics import mean, median
from typing import Optional


@dataclass
class OnCallShift:
    engineer: str
    start_time: datetime
    end_time: datetime
    pages_received: int
    incidents_handled: int
    sleep_interruptions: int     # Pages between 10pm and 7am local time
    total_incident_minutes: int
    false_positives: int         # Pages that required no action


@dataclass
class OnCallHealthReport:
    period_start: datetime
    period_end: datetime
    shifts: list[OnCallShift]

    @property
    def mean_pages_per_shift(self) -> float:
        return mean(s.pages_received for s in self.shifts)

    @property
    def mean_sleep_interruptions_per_shift(self) -> float:
        return mean(s.sleep_interruptions for s in self.shifts)

    @property
    def false_positive_rate(self) -> float:
        total = sum(s.pages_received for s in self.shifts)
        false = sum(s.false_positives for s in self.shifts)
        return false / total if total > 0 else 0.0

    @property
    def mean_incident_duration_minutes(self) -> float:
        incidents = sum(s.incidents_handled for s in self.shifts)
        minutes = sum(s.total_incident_minutes for s in self.shifts)
        return minutes / incidents if incidents > 0 else 0.0

    def is_healthy(self) -> bool:
        return (
            self.mean_pages_per_shift <= 2
            and self.mean_sleep_interruptions_per_shift <= 1
            and self.false_positive_rate <= 0.10
            and self.mean_incident_duration_minutes <= 30
        )

    def summary(self) -> str:
        status = "HEALTHY" if self.is_healthy() else "NEEDS ATTENTION"
        return (
            f"On-call health: {status}\n"
            f"Mean pages per shift: {self.mean_pages_per_shift:.1f} (target: <= 2)\n"
            f"Mean sleep interruptions: {self.mean_sleep_interruptions_per_shift:.1f} (target: <= 1)\n"
            f"False positive rate: {self.false_positive_rate:.0%} (target: <= 10%)\n"
            f"Mean incident duration: {self.mean_incident_duration_minutes:.0f} min (target: <= 30)\n"
        )

These thresholds are not arbitrary. A rotation with more than two pages per shift requires significant engineering attention to the alert configuration. A rotation with more than one sleep interruption per shift is burning engineers at a rate that is not sustainable. A false positive rate above ten percent is training engineers to distrust their alerts. An incident duration above thirty minutes suggests observability or runbook gaps that are solvable with engineering work.

The rotation that earns trust

There is an on-call experience that is genuinely fine. Not good in the sense of fun, but fine in the sense of tolerable and fair. The shift is occasionally interrupted. The interruptions are for real things. The tools available make resolution straightforward. The runbooks are accurate. The experience is unpleasant in the way that any interruption is unpleasant and not more unpleasant than that.

Engineers who have experienced this kind of rotation understand on-call as part of the job rather than as a punishment. They accept the responsibility because the responsibility is reasonable and because the systems support them in carrying it.

Engineers who have experienced the other kind, the rotation with constant noise, with incidents that take hours, with systems nobody fully understands, with runbooks that describe configurations that no longer exist, understand on-call as a form of institutional neglect. The message the rotation sends is that the organisation does not value their time or their wellbeing enough to invest in making the systems operable.

That message is accurate. The rotation reflects the investment. The investment reflects the values.

Improving the rotation means improving the systems, which means making the investment. There is no shortcut through scheduling.

The pager is not the problem. Read what it is telling you.