At 3:47 PM on a Tuesday, your /api/payments endpoint starts returning 500 errors. Some requests succeed, some do not. The error rate climbs from 2% to 15% over 20 minutes. At 4:12 PM, a customer tweets about a failed purchase. At 4:18 PM, your support team sends a Slack message: "Are we having payment issues?"
Your team scrambles. Someone checks the logs. Someone restarts the service. Someone reverts the last deploy. Forty-five minutes later, the issue is resolved. It was a database connection pool exhaustion caused by a missing connection.release() in a rarely-hit code path.
This is not a rare scenario. It is how most small API teams experience incidents — reactively, chaotically, and with a Mean Time to Detect (MTTD) measured in customer complaints.
This guide outlines a 5-step incident response lifecycle adapted specifically for API and development teams. Not the generic NIST or ITIL frameworks designed for enterprise security teams with 24/7 NOCs. A practical framework for teams of 2-15 engineers shipping API-driven products.
The Two Metrics That Matter: MTTD and MTTR
Before the 5 steps, understand the two metrics that define your incident response effectiveness:
MTTD (Mean Time to Detect): The time between when an incident starts and when your team knows about it. In the example above, the payment errors started at 3:47 PM and the team learned about it at 4:18 PM. MTTD = 31 minutes.
MTTR (Mean Time to Resolve): The time from detection to resolution. The team learned at 4:18 PM and resolved at ~5:03 PM. MTTR = 45 minutes.
Total incident duration = MTTD + MTTR = 76 minutes.
Most teams focus on reducing MTTR — faster debugging, faster deploys, faster rollbacks. But MTTD is often the bigger problem. If your team finds out about outages from customer complaints or support tickets, your MTTD is already 30-60 minutes. That is 30-60 minutes of users hitting errors before anyone even looks at the problem.
Reducing MTTD from 30 minutes to 30 seconds has a bigger impact than reducing MTTR from 45 minutes to 30 minutes.
Step 1: Detect
Detection is the foundation. You cannot respond to what you do not know about. There are three levels of detection, from worst to best:
Level 1: Customer-reported (worst)
You learn about the outage from a support ticket, a tweet, or a Slack message from an angry user. MTTD: 15-60+ minutes. By the time you know, the damage is done.
Level 2: External monitoring
A tool like Pingdom or UptimeRobot pings your endpoint every 60 seconds and alerts when it is down. Better than customer reports, but limited: it checks one URL at a time, misses intermittent errors, and cannot detect performance degradation (a 200 response that took 8 seconds instead of 200ms).
Level 3: Internal monitoring (best)
Monitoring that runs inside your server and tracks every request. It sees every API route, every response code, every response time. It alerts on error rate spikes, latency degradation, and anomalies — not just "is the server reachable?" MTTD: seconds.
The difference between Level 2 and Level 3 is the difference between checking if the building is on fire from outside versus having smoke detectors in every room.
What to monitor for detection
- Error rate per endpoint: The percentage of requests returning 5xx. Alert when it exceeds your baseline (e.g., >1% over 5 minutes).
- Response time per endpoint: P95 latency. Alert when it degrades significantly (e.g., P95 doubles from baseline).
- Status code distribution: Sudden increase in 4xx can indicate a broken client, a bad deploy, or a changed API contract.
- Traffic volume: A sudden drop in traffic can indicate that clients are failing to reach your service entirely.
Step 2: Triage
Once you know something is wrong, the next question is: how bad is it?
Triage should take under 5 minutes. You are not debugging yet. You are assessing severity to decide how aggressively to respond.
Severity levels for API incidents
| Severity | Definition | Example | Response |
|---|---|---|---|
| SEV-1 (Critical) | Core functionality completely down | All payment endpoints returning 500 | All hands, immediate response |
| SEV-2 (Major) | Core functionality degraded | Payments succeeding but P95 at 8 seconds | Primary on-call, within 15 minutes |
| SEV-3 (Minor) | Non-core functionality affected | Analytics endpoint returning stale data | Next business day |
| SEV-4 (Low) | Cosmetic or edge case | Error message has a typo | Backlog |
Triage questions to answer quickly:
- Which endpoints are affected? One endpoint or many?
- What is the error rate? 1% or 100%?
- Is it getting worse? Stable error rate or climbing?
- Which users are affected? All users or a specific segment?
- When did it start? Did it correlate with a deploy, a config change, or a third-party event?
Step 3: Mitigate
Mitigation is about stopping the bleeding. Not fixing the root cause. Not debugging. Stopping the impact on users as fast as possible.
Common mitigation actions for API incidents:
- Rollback the last deploy. If the incident started after a deployment, rolling back is the fastest mitigation. Every team should be able to rollback in under 5 minutes.
- Toggle a feature flag. If the broken code is behind a feature flag, turn it off. This is why feature flags exist.
- Scale up resources. If the issue is resource exhaustion (connection pools, memory, CPU), add capacity while you debug.
- Enable a circuit breaker. If a third-party dependency is failing, trip the circuit breaker to return a cached response or graceful degradation instead of 500s.
- Rate limit or block. If the issue is caused by abusive traffic, apply rate limiting or block the offending IPs/keys.
- Redirect traffic. If one region is affected, redirect traffic to a healthy region.
The goal of mitigation is to reduce the impact while you work on the actual fix. A mitigated incident with a workaround in place buys you time to debug properly instead of panic-coding a fix.
Step 4: Resolve
Resolution is the actual fix. Once mitigation has stopped the bleeding, you can debug methodically:
- Correlate with changes. What changed before the incident started? Deploys, config changes, infrastructure updates, third-party service changes.
- Check the data. Look at error logs, monitoring dashboards, and traces. What do the failing requests have in common?
- Reproduce locally. If possible, reproduce the failure in a local or staging environment to validate your fix before deploying.
- Fix and deploy. Apply the fix, deploy to staging, verify, deploy to production.
- Verify resolution. Confirm that error rates return to baseline and affected endpoints are healthy. Do not just check once — monitor for 15-30 minutes to ensure stability.
- Remove mitigation. If you rolled back, re-deploy the fixed version. If you toggled a feature flag, turn it back on. If you scaled up, scale back down after confirming the fix works.
Step 5: Learn
The post-incident review (often called a "postmortem" or "retrospective") is where most teams skip. It is also where the most long-term value lives.
Post-incident review template
Timeline:
- When did the incident start?
- When was it detected? (MTTD)
- When was it mitigated?
- When was it fully resolved? (MTTR)
Impact:
- How many users were affected?
- How many requests failed?
- Was there revenue impact?
Root cause:
- What was the technical root cause?
- What was the process root cause? (Why was this possible?)
What went well:
- What worked in the response?
- What tools or processes helped?
What went poorly:
- Where did the response break down?
- What information was missing?
Action items:
- Concrete actions with owners and due dates.
- Examples: "Add connection pool monitoring (Alice, by Friday)," "Add circuit breaker for payment service (Bob, next sprint)."
Critical rule: Post-incident reviews are blameless. The goal is to improve systems and processes, not to assign blame. If your culture punishes people for incidents, people will hide incidents. That makes everything worse.
Reducing MTTD: The Highest-Leverage Improvement
If you take one thing from this article, let it be this: invest in reducing MTTD. Everything else in the lifecycle — triage, mitigate, resolve, learn — depends on knowing there is a problem in the first place.
For API teams running Next.js applications, Nurbak Watch reduces MTTD to seconds. It monitors every API route from inside your server using the instrumentation.ts hook — five lines of code — and sends alerts via Slack, email, or WhatsApp in under 10 seconds when error rates spike or response times degrade.
// instrumentation.ts
import { initWatch } from '@nurbak/watch'
export function register() {
initWatch({
apiKey: process.env.NURBAK_WATCH_KEY,
})
}$29/month flat, free during beta. The difference between finding out about an outage from a customer tweet at 4:18 PM and an automated alert at 3:47 PM is the difference between 76 minutes of impact and 45 minutes of impact. That is 340 fewer failed requests and zero angry tweets.

