All case studiesThe droid that runs on-call triage

On-call, without the 3am page.

A droid that wakes on every alert: it gathers the logs and the recent deploys, reproduces the failure on its own sandboxed machine, and files a clean issue with the root cause — paging a human only for the incidents that truly need one, and remembering every flake so the next alert is never a cold start.

< 2 min
from alert to a triaged issue with root-cause notes — day or night
~70%
of alerts handled or auto-resolved without waking anyone
0 alerts
left unread overnight — every page acknowledged and worked
Does
Triages alerts, reproduces & files them
Where
SaaS platform team, ~12 engineers
Reaches people on
Slack, paging
Works inside
Sentry, GitHub, Linear
Runs
On every alert · 24/7
The situation

Most alerts aren't incidents — but you dig every time to find out.

Every alert pulled an engineer out of whatever they were doing. Most turned out to be a known flake or a bad deploy — but you couldn’t know that until you’d opened Sentry, read the stack trace, checked what shipped recently, and tried to reproduce it. The digging was the same every time; only the answer changed.

Nights and weekends were the worst of it. A page at 3am usually meant twenty minutes of context-gathering before you even knew whether it mattered, and the noisy alerts trained everyone to half-ignore the channel — which is exactly how the one that mattered slipped through.

How it works

How the droid took it on.

Rather than ask an engineer to triage every red notification, the team handed first response over. The moment an alert fires the droid gathers the context, reproduces the failure on its own machine, files it cleanly, and pages a person only when it’s real and urgent.

TASK#418On-call triagestanding
trigger
Any alert from monitoring (Sentry)
also
A shift-end digest to #on-call
scope
Every service alert · 24/7
runs as
A contained droid action per alert — on its own sandboxed machine
memory
Per-service notes: known flakes, recent incidents, owners

Set up once, in plain language — “when an alert fires, gather everything, try to reproduce it, file it cleanly, and only page me if it's real and urgent.” The droid turned that into a standing job that owns first response — so the team stopped getting dragged out of flow by every red notification.

Every alert trips the same loop:

Ongoing handling

How it ran, alert after alert.

Here’s an overnight shift as it actually unfolded — alert by alert, down to how it reached for each tool. Only one incident ever woke a human.

  1. 1:12amalert · checkout
    • Sentry
    • GitHub
    • Linear

    Checkout 500s on the new release — gathered the logs and deploys, reproduced it on a clean box, and filed ENG-2231.

  2. 2:30amflake · reports worker
    • Sentry

    A DB-timeout alert — recognized a known transient from memory and auto-resolved it. Nobody paged.

  3. 3:05amsev-1 · paymentsescalated
    • Sentry
    • Slack

    Payments failing platform-wide — ruled out a deploy, traced it to an upstream provider outage, and paged on-call with context.

  4. 6:20amnoise · duplicates
    • Sentry

    Three duplicate alerts on already-open issues — deduped them and kept the channel quiet.

  5. 8:40amquestion · checkout
    • GitHub
    • Linear

    An engineer asked if it was the late-night deploy — confirmed the suspect commit, evidence already in the issue.

  6. 9:00amshift digest

    The night reconciled — 6 alerts: 1 bug, 1 flake auto-resolved, 1 sev-1 paged, 3 noise. One issue open and assigned.

See it in action

One night shift, alert by alert.

Alerts and the occasional engineer question land on the left. Watch the droid pick up each one and work it end to end — across Sentry, GitHub and Linear, and on its own machine — paging a human only for the incident that warrants it.

We didn't replace on-call — we gave it a first responder. An alert fires and by the time anyone looks it's already reproduced, written up, and assigned, or it's been quietly resolved as a known flake. The only thing that reaches a person now is the thing that actually needs a person.
Daniel R.Eng Lead, SaaS platform team

An illustrative workflow based on real product mechanics. Tool names and behaviour reflect how a droid actually triggers on events, runs work on its own machine, and calls connected apps; figures are directional.

Try it with your droid

Run this workflow yourself.

Copy the brief below and paste it to your droid. It’ll walk you through the prerequisites, connect what it needs, and stand the workflow up with you.

Workflow brief
I lead a small platform team, and on-call is wearing us down. Every alert drags someone out of flow to do the same dig — open Sentry, read the trace, check what shipped, try to reproduce it — and most of the time it's a known flake or a bad deploy. Nights are the worst: a 3am page is usually twenty minutes of context-gathering before you even know if it matters, and the noisy alerts have trained everyone to half-ignore the channel.

Own first response to alerts. Apps/channels: Sentry (alerts, errors, releases), GitHub (recent deploys, commits, reverts), Linear (filing and assigning issues), Slack #on-call (triage notes and paging). Use your own sandboxed machine to actually reproduce failures.

Run whenever an alert fires, plus a shift-end digest to #on-call. For each alert:
1. Read the error, how often it's firing, and which release it's on.
2. Check what shipped recently in case a deploy is the cause.
3. Spin up a clean copy of the service on your own machine and try to reproduce it.
4. If it's a real bug, file an issue with the stack trace, repro steps, your root-cause hypothesis and the suspect commit, and assign it to the service owner.
5. Post a short triage note to #on-call so people know what's happening.

Use judgment on what reaches a human: auto-resolve known flakes and transient noise, file and assign real bugs, and page the on-call engineer directly — with full context — only for genuine, urgent incidents. Remember known flakes and recent incidents per service so each alert starts from what we already know, not from scratch.

What would a droid take off your desk?

Tell us the job that never gets done before close. We'll wire up a droid on a call and you can watch it work.