The droid that runs on-call triage

On-call, without the 3am page.

A droid that wakes on every alert: it gathers the logs and the recent deploys, reproduces the failure on its own sandboxed machine, and files a clean issue with the root cause — paging a human only for the incidents that truly need one, and remembering every flake so the next alert is never a cold start.

< 2 min

from alert to a triaged issue with root-cause notes — day or night

~70%

of alerts handled or auto-resolved without waking anyone

0 alerts

left unread overnight — every page acknowledged and worked

Does: Triages alerts, reproduces & files them
Where: SaaS platform team, ~12 engineers
Reaches people on: Slack, paging
Works inside: Sentry, GitHub, Linear
Runs: On every alert · 24/7

The situation

Most alerts aren't incidents — but you dig every time to find out.

Every alert pulled an engineer out of whatever they were doing. Most turned out to be a known flake or a bad deploy — but you couldn’t know that until you’d opened Sentry, read the stack trace, checked what shipped recently, and tried to reproduce it. The digging was the same every time; only the answer changed.

Nights and weekends were the worst of it. A page at 3am usually meant twenty minutes of context-gathering before you even knew whether it mattered, and the noisy alerts trained everyone to half-ignore the channel — which is exactly how the one that mattered slipped through.

Isometric incident scene: a sandbox machine flanked by an error-spike monitor, a commit graph, an alert bell and an issue card

How it works

How the droid took it on.

Rather than ask an engineer to triage every red notification, the team handed first response over. The moment an alert fires the droid gathers the context, reproduces the failure on its own machine, files it cleanly, and pages a person only when it’s real and urgent.

TASK#418On-call triagestanding

trigger: Any alert from monitoring (Sentry)
also: A shift-end digest to #on-call
scope: Every service alert · 24/7
runs as: A contained droid action per alert — on its own sandboxed machine
memory: Per-service notes: known flakes, recent incidents, owners

Set up once, in plain language — “when an alert fires, gather everything, try to reproduce it, file it cleanly, and only page me if it's real and urgent.” The droid turned that into a standing job that owns first response — so the team stopped getting dragged out of flow by every red notification.

Every alert trips the same loop:

01Alert firesSentry / monitoring
02Droid wakesOn its own machine
03Reads logs + deploys + memoryError, release, past flakes
04Reproduces itClean sandbox clone
05Files the issueRoot cause + repro + owner
06Real & urgent?Noise → auto-resolve · Bug → file & assign · Sev-1 → page on-call
07Posts to #on-callTriage note or page

Ongoing handling

How it ran, alert after alert.

Here’s an overnight shift as it actually unfolded — alert by alert, down to how it reached for each tool. Only one incident ever woke a human.

1:12amalert · checkout
Checkout 500s on the new release — gathered the logs and deploys, reproduced it on a clean box, and filed ENG-2231.
2:30amflake · reports worker
A DB-timeout alert — recognized a known transient from memory and auto-resolved it. Nobody paged.
3:05amsev-1 · paymentsescalated
Payments failing platform-wide — ruled out a deploy, traced it to an upstream provider outage, and paged on-call with context.
6:20amnoise · duplicates
Three duplicate alerts on already-open issues — deduped them and kept the channel quiet.
8:40amquestion · checkout
An engineer asked if it was the late-night deploy — confirmed the suspect commit, evidence already in the issue.
9:00amshift digest
The night reconciled — 6 alerts: 1 bug, 1 flake auto-resolved, 1 sev-1 paged, 3 noise. One issue open and assigned.

See it in action

One night shift, alert by alert.

Alerts and the occasional engineer question land on the left. Watch the droid pick up each one and work it end to end — across Sentry, GitHub and Linear, and on its own machine — paging a human only for the incident that warrants it.

SentryI read the error and the shape of it — a 500 on the checkout path, ~40 a minute, all on the release that went out tonight.trigger: alert.fired → read(error, freq, release)
GitHubI check what shipped recently and find a deploy 22 minutes ago that touched checkout validation. Prime suspect.action: list(recent_deploys) → suspect(abc123)
Its own machineI spin up a clean copy of the service on my own sandboxed machine and replay the checkout path — the 500 reproduces every time.sandbox: clone + reproduce(checkout) = 500 ✓
SlackThen I post a first triage note to #on-call — what's failing, how often, the likely cause, and that I'm filing it.action: post(#on-call, triage)

Heads-up: checkout 500s started ~1:10am, ~40/min, all on tonight's release. Reproduced it on a clean box — looks like the deploy 22 min ago. Filing it now; no action needed yet.

We didn't replace on-call — we gave it a first responder. An alert fires and by the time anyone looks it's already reproduced, written up, and assigned, or it's been quietly resolved as a known flake. The only thing that reaches a person now is the thing that actually needs a person.

Daniel R.Eng Lead, SaaS platform team

An illustrative workflow based on real product mechanics. Tool names and behaviour reflect how a droid actually triggers on events, runs work on its own machine, and calls connected apps; figures are directional.

Try this workflow

Run this workflow yourself.

Copy the brief below and paste it into Unify. It’ll walk you through the prerequisites, connect what it needs, and stand the workflow up with you.

Workflow brief

I lead a small platform team, and on-call is wearing us down. Every alert drags someone out of flow to do the same dig — open Sentry, read the trace, check what shipped, try to reproduce it — and most of the time it's a known flake or a bad deploy. Nights are the worst: a 3am page is usually twenty minutes of context-gathering before you even know if it matters, and the noisy alerts have trained everyone to half-ignore the channel.

Own first response to alerts. Apps/channels: Sentry (alerts, errors, releases), GitHub (recent deploys, commits, reverts), Linear (filing and assigning issues), Slack #on-call (triage notes and paging). Use your own sandboxed machine to actually reproduce failures.

Run whenever an alert fires, plus a shift-end digest to #on-call. For each alert:
1. Read the error, how often it's firing, and which release it's on.
2. Check what shipped recently in case a deploy is the cause.
3. Spin up a clean copy of the service on your own machine and try to reproduce it.
4. If it's a real bug, file an issue with the stack trace, repro steps, your root-cause hypothesis and the suspect commit, and assign it to the service owner.
5. Post a short triage note to #on-call so people know what's happening.

Use judgment on what reaches a human: auto-resolve known flakes and transient noise, file and assign real bugs, and page the on-call engineer directly — with full context — only for genuine, urgent incidents. Remember known flakes and recent incidents per service so each alert starts from what we already know, not from scratch.

What would you take off your desk?

Tell us the job that never gets done before close. We'll wire it up on a call and you can watch it work.

Try now — free Book a walkthrough