On-Call Best Practices for Engineering Teams

What Separates a Good On-Call Programme from a Miserable One

On-call is one of the most contentious aspects of engineering culture. In poorly run organisations, it is a source of burnout, resentment, and attrition. Engineers receive dozens of low-quality pages per week, many of which resolve themselves or require no action. Sleep is disrupted regularly. The same incidents recur because there is no time to fix root causes between the next page and the next. Eventually, good engineers leave.

In well-run organisations, on-call is manageable and occasionally interesting. Engineers are paged rarely, and when they are, it is for genuine problems that require their expertise. Runbooks exist for common failure modes. Escalation paths are clear. Post-incident reviews actually result in improvements. Being on-call is a rotation that carries some burden but is not a source of dread.

The difference between these two states is not company size, not budget, and not the sophistication of the monitoring tooling. It comes down to process discipline: the commitment to treating every unnecessary page as a bug to be fixed, and every incident as a learning opportunity. An effective website monitoring setup is the foundation, but the human systems built around it are what determine whether engineers are protected or overwhelmed.

This guide covers the core components of a sustainable on-call programme: rotation design, runbook creation, escalation policy, metrics, and post-incident review. Implement these practices and you will dramatically reduce the burden on your team while improving your organisation's ability to detect and respond to genuine incidents.

Designing On-Call Rotations That Engineers Actually Tolerate

Rotation design is the most visible aspect of on-call health, and also the easiest to get wrong. The most common mistake is treating on-call as something that happens automatically — whoever is available gets paged — rather than as a formal responsibility that requires explicit design.

Rotation Frequency and Team Size

A sustainable rotation ensures that no engineer carries on-call burden continuously. As a rule of thumb, no individual should be on-call primary more than one week in five. This means a minimum team size of five engineers to run a healthy primary rotation. Below that, either the frequency becomes unsustainable or you need to explicitly accept the trade-off and compensate accordingly.

For smaller teams, consider tiered rotations:

Primary on-call — the first responder to all alerts. Handles triage and initial response.
Secondary on-call — backup if the primary does not acknowledge within five minutes. Also available for consultation on complex incidents.
Manager escalation — triggered automatically after 15 minutes of no acknowledgement. Not expected to debug — responsible for coordination and communication.

Primary and secondary should rotate independently so the same two engineers are not paired every cycle. This prevents knowledge concentration and ensures everyone understands the full system over time.

Defining On-Call Hours

A common and damaging assumption is that on-call means "reachable 24/7 indefinitely." Being technically on-call around the clock for weeks at a time degrades performance, health, and morale. Wherever your business constraints allow, define explicit on-call windows:

Follow-the-sun — if you have engineers across time zones, route alerts to the team in their working hours. This is the most sustainable approach and eliminates most middle-of-night pages.
Business hours + emergency only — for applications where downtime outside business hours has low business impact, restrict out-of-hours alerting to Tier 1 critical incidents only.
24/7 with handoffs — for applications requiring constant coverage, define explicit shift handoffs with briefings, so the incoming on-call has context on any active issues.

PulseStack uptime monitoring dashboard showing site status, active alerts, and on-call engineer overview — A centralised monitoring dashboard gives the on-call engineer a single view of the current state of all monitored services

Compensation and Recognition

On-call is work. Engineers are accepting a constraint on their personal time and carrying cognitive load even when nothing is happening. This should be recognised explicitly. Common approaches include extra pay per on-call week, time off in lieu at a defined ratio, or on-call allowances. The specific form matters less than the consistency and transparency: engineers should know in advance what they receive for on-call duty and be able to rely on it.

Uncompensated on-call is a retention risk. When a competitor offers equivalent work without an on-call requirement, uncompensated on-call becomes a reason to leave. Budget for it.

Runbooks: Eliminating Guesswork During Incidents

A runbook is a documented procedure for responding to a known failure mode. When a monitor fires at 3 AM, the on-call engineer should not need to figure out from first principles what the alert means, what systems are affected, and how to investigate. A good runbook contains that context. It reduces the cognitive load on a sleep-deprived engineer and dramatically shortens mean time to resolution.

What Every Runbook Should Include

Effective runbooks are short, specific, and opinionated. They should tell the engineer what to do, not just what to look at. A useful structure:

Alert name and meaning — what condition triggered this page, and why it matters
Business impact — what is currently broken or at risk for users
Immediate triage steps — the first three to five commands or checks to run, in order
Common causes — the failure modes that cause this alert 80% of the time
Remediation steps — specific commands, configuration changes, or escalation paths for each common cause
Escalation criteria — when should the on-call escalate rather than continue debugging alone
Links — to dashboards, logs, infrastructure diagrams, and related runbooks

PulseStack monitor configuration panel showing alert setup, runbook link field, and escalation policy configuration — Linking runbooks directly from alert configuration ensures engineers always have the right context at hand during an incident

Who Writes Runbooks

The best runbooks are written by the engineer who most recently debugged the problem in production. They have the freshest knowledge of what actually broke and what actually fixed it. Institutionalise this: after every incident, the engineer who resolved it is responsible for creating or updating the runbook for that alert before closing the ticket.

Do not let perfect be the enemy of good. A runbook that covers the most common case and contains three accurate triage commands is infinitely more valuable than a comprehensive document that nobody has time to write. Start with high-frequency or high-severity alerts first.

Keeping Runbooks Current

Stale runbooks are worse than no runbooks. An engineer who follows outdated instructions during an incident loses time and may make the situation worse. Treat runbooks like production code: they require maintenance as the system evolves.

Effective practices for runbook hygiene:

Store runbooks in version control alongside the code they describe
Add a "last verified" date to every runbook and flag any not verified in the past 90 days
Include runbook review in your quarterly operational audit
When a runbook is followed and does not resolve the issue, updating it is part of the incident follow-up

Link runbooks directly from your monitoring configuration. When the alert fires, the link should be in the notification. Removing the step of finding the runbook reduces response time and ensures the right procedure is used.

Escalation Policies: From Alert to Resolution Without Bottlenecks

An escalation policy is the contractual backbone of your incident response. It defines who is responsible at each stage, what triggers the next stage, and how long each stage can run before automatic escalation kicks in. Without a documented escalation policy, incidents stall because it is unclear who should act next.

The Core Escalation Ladder

A standard escalation ladder for a web application looks like this:

T+0: Primary on-call paged — alert fires, primary on-call receives phone notification and push alert
T+5 min: Secondary on-call paged — if primary has not acknowledged, secondary is paged automatically
T+15 min: Engineering manager alerted — if neither primary nor secondary has acknowledged, the manager is notified
T+30 min: VP/CTO alerted — for extended critical incidents, executive escalation enables resource allocation and external communication decisions

PulseStack alert escalation flow showing automatic escalation from primary on-call through secondary on-call to engineering manager with time triggers — Automatic time-triggered escalation ensures that no critical alert is ever left unacknowledged

Escalation must be automatic, not manual. Expecting the primary to remember to escalate when they cannot respond — because they are asleep, in a dead zone, or handling another incident — creates a single point of failure. Configure your monitoring and incident management tools to handle escalation mechanically.

Severity Classification and Escalation Criteria

Not all incidents warrant the same escalation ladder. Define severity levels and escalation paths for each:

SEV1 (Critical) — full outage or critical path broken, immediate user impact. Full escalation ladder active. Update internal stakeholders every 30 minutes. External status page updated within 15 minutes of detection.
SEV2 (Major) — significant degradation or partial outage. Primary on-call handles. Escalate to secondary if not resolved in 30 minutes. Update internal stakeholders.
SEV3 (Minor) — limited impact, workarounds available. Primary on-call handles during business hours. Log and monitor. No after-hours escalation unless escalating to SEV2.

Publish these severity definitions internally. When an incident occurs, the first step is classification. The classification determines who is informed, how frequently they are updated, and what level of disruption is acceptable to fix the issue. Clear severity definitions reduce the number of "is this a big deal?" conversations during incidents when clarity matters most.

Communication During Incidents

Escalation policy includes communication obligations, not just notification chains. For SEV1 incidents:

An incident commander (often the engineering manager) takes ownership of coordination and communication, freeing the on-call engineer to focus on diagnosis
A dedicated incident channel is created in Slack or Teams with a consistent naming convention (e.g., #inc-2026-03-18)
The public status page is updated within a defined SLA — users should not find out about outages from social media
A brief internal update is posted every 30 minutes until resolution, even if it is just "still investigating, no resolution yet"

Silence during an incident is more alarming than honest "still working on it" updates. Both internal teams and external customers handle uncertainty better when they receive regular, honest communication.

On-Call Metrics: Measuring Programme Health

Intuition about on-call health is unreliable. Engineers who have normalised a bad experience often underestimate how much burden they are carrying. Metrics provide an objective baseline and reveal trends before they become crises.

The Essential On-Call Metrics

Pages per on-call shift — the single most important metric. A healthy target is fewer than five actionable pages per week per on-call engineer. If the number is consistently higher, the excess pages are a backlog of engineering work to be addressed, not accepted.
Out-of-hours pages — how many pages are arriving outside working hours? Track separately from total pages. Sustained out-of-hours interruptions have a disproportionate impact on health and morale.
Mean time to acknowledge (MTTA) — how long from alert firing to acknowledgement? Rising MTTA is an early warning sign of alert fatigue — engineers are becoming slower to respond, often because they have learned that most alerts resolve themselves.
Mean time to resolve (MTTR) — from acknowledgement to incident closure. Track by severity and by service. Consistently high MTTR on specific services points to runbook gaps or knowledge concentration.
Actionable alert rate — what percentage of pages required engineer intervention? Anything below 80% actionable is noise that should be eliminated. See the alert fatigue guide for strategies to improve this ratio.
Escalation rate — how often does the primary on-call escalate? High escalation rates may indicate that the primary is under-equipped — missing runbooks, insufficient system knowledge, or overly complex infrastructure.
Repeat incidents — the percentage of incidents caused by the same root cause as a previous incident. High repeat rate means post-incident actions are not being completed or are ineffective.

Monthly On-Call Health Review

Once per month, review these metrics with the team. Present the trends, not just the current values. A single bad week is noise. A three-month trend of rising page volume or falling actionable rates is a system problem that needs structural attention.

The review should result in a short list of concrete actions: specific monitors to tune, specific runbooks to write, specific root causes to address. These actions should be tracked as engineering work with owners and deadlines, not filed in a document that nobody reads. On-call health is a product feature — it deserves space on the roadmap alongside user-facing improvements.

Post-Incident Reviews: Learning From Every Page

The post-incident review (also called a post-mortem or incident retrospective) is the mechanism by which on-call experience translates into system improvement. Without it, incidents are just interruptions. With it, they are inputs to a continuous improvement process that makes future incidents less frequent and less severe.

What a Good Post-Incident Review Looks Like

A post-incident review is not a blame session. The blameless post-mortem principle — borrowed from aviation and healthcare — holds that incidents are caused by system conditions, not individual failures. An engineer who makes a mistake under pressure, with incomplete information, using ambiguous tooling, is behaving normally. The question is why the system allowed the conditions that made the mistake possible, not how to punish the individual.

A productive post-incident review covers:

Timeline — a factual, minute-by-minute account of what happened. When did the incident start? When was it detected? What actions were taken and when? When was it resolved?
Root cause analysis — the underlying system condition that made the incident possible. Use the "five whys" method: keep asking why until you reach a root cause that is actionable.
Contributing factors — conditions that made the incident worse, harder to detect, or slower to resolve. These often reveal the most valuable improvement opportunities.
What went well — detection mechanisms that worked, runbooks that were accurate, team communication that was effective. These practices should be reinforced.
Action items — specific, owned, time-bound work items to address root causes and contributing factors. Every action item needs an owner and a due date or it will not happen.

When to Run a Review

Run a post-incident review for every SEV1 incident and any SEV2 incident with interesting learning value. Aim to hold the review within 48-72 hours of resolution while memories are fresh. For SEV1 incidents, share a written summary with stakeholders within 48 hours.

Do not let the perfect structure prevent you from reviewing. A 30-minute conversation that produces three actionable improvement items is far more valuable than a polished document that takes two weeks to produce and gets filed away unread.

Closing the Loop on Action Items

The most common failure mode in post-incident reviews is excellent identification of root causes and action items followed by no follow-through. Engineers return to normal feature work, the action items sit in a backlog, and two months later the same incident recurs.

Prevent this by treating post-incident action items as technical debt that must be scheduled. Designate one person (typically the engineering manager) as the owner of the post-incident action item backlog. Review it in every sprint planning. Make "post-incident items completed on time" a team health metric. The goal is to make the repeat incident rate — the same root cause causing a second incident — as close to zero as possible.

Connecting your incident management workflow to your issue tracker automates this loop: action items created in an incident become tickets that appear in the sprint, rather than living only in a post-mortem document.

Tooling and Automation for Modern On-Call Teams

Good process reduces on-call burden. Good tooling amplifies it. The right combination of monitoring, alerting, and incident management tools removes manual friction from every step of the on-call workflow — from alert delivery to escalation to post-incident documentation.

The Minimal Effective On-Call Stack

You do not need a complex tool ecosystem to run effective on-call. The essential components are:

Monitoring with multi-location verification — a website monitoring platform that checks from multiple geographic locations, so a single-location failure does not page the on-call for a network blip. Multi-location confirmation dramatically reduces false positive rates.
Intelligent alerting with consecutive failure thresholds — alerts should require two or more consecutive failures before firing. Single-failure alerts are the primary driver of unnecessary pages.
Automated escalation — your alerting or incident management tool should handle escalation automatically. Manual escalation is a single point of failure.
Incident channel creation — automatically create a Slack or Teams channel for each incident. Having a persistent conversation record is invaluable during and after the incident.
Runbook linking from alerts — alerts should contain a direct link to the relevant runbook. Removing the search step reduces response time and ensures the right procedure is used.
Post-incident report templates — standardise the format so reviews happen faster and are easier to compare over time.

Automation Opportunities

As your on-call programme matures, look for repetitive tasks in your incident response that can be automated:

Auto-remediation for known failure modes — if a specific alert consistently resolves by restarting a service, automate the restart and page only if the restart fails. This turns a human-intervention incident into a logged auto-remediation event.
Maintenance window scheduling — integrate deployment pipelines with your monitoring tool to automatically suppress alerts during planned deployments. See the alert fatigue guide for implementation details.
Status page updates — automatically update your public status page when monitors enter a failing state, without requiring manual engineer intervention during the incident.
On-call handoff automation — generate a daily on-call briefing from your monitoring dashboard showing the current state of all monitors, any active incidents, and recent changes. This replaces the verbal handoff that is often skipped when engineers are busy.

Avoiding Tool Sprawl

On-call tooling tends to accumulate. Teams start with a monitoring tool, add an incident management tool, add a status page tool, add a runbook tool, and end up with four separate systems that do not integrate well. Context is lost between tools. Alert notifications do not link to runbooks. Incident timelines do not include monitoring data.

Consolidate where possible. Choose tools that integrate with each other and provide data in one place. An on-call engineer managing an active incident should not need to switch between four tabs to get the information they need. Reducing friction during incidents is a direct investment in response speed and engineer sanity.

Start with the right monitoring foundation at a plan that fits your team, build the human processes described in this guide, and automate incrementally as you identify the highest-friction steps in your workflow. The goal is an on-call programme that is sustainable, trustworthy, and continuously improving — one where engineers feel equipped to handle incidents rather than overwhelmed by them.

Start monitoring your infrastructure today

50 free monitors, no credit card needed. Set up in under 30 seconds.

Get started free