Your /api/checkout started returning 500 errors at 2:13 AM. Your external monitor pinged the health check at 2:15 AM — it returned 200 (the health check doesn't test checkout). Your on-call engineer got paged at 3:00 AM when a customer tweeted about failed payments.

MTTD for that incident: 47 minutes. Nearly an hour of lost revenue before anyone knew.

Mean Time to Detect (MTTD) is the most underrated metric in incident management. Everyone focuses on MTTR (time to fix). But you can't fix what you don't know is broken.

What is MTTD?

MTTD = time when incident is detected − time when incident began

It measures the gap between "something broke" and "someone knows." This gap is pure waste — every second of undetected downtime is a second of user impact with zero response activity.

    Incident timeline:
─────────────────────────────────────────────────
2:13 AM  │ API starts returning 500s
         │
         │  ← MTTD: 47 minutes (nobody knows yet)
         │
3:00 AM  │ Customer tweets, engineer paged
         │
         │  ← MTTR: 23 minutes (fixing the issue)
         │
3:23 AM  │ Fix deployed, API recovered
─────────────────────────────────────────────────
Total incident: 70 minutes
Detection took: 67% of total time

In this example, detection was 67% of the total incident duration. Cutting MTTD from 47 minutes to 1 minute would have reduced total impact from 70 minutes to 24 minutes — without changing how fast the team fixes things.

MTTD vs MTTR vs MTTA vs MTBF

MetricMeasuresYou control it by
MTTD (Detect)Time to discover the incidentBetter monitoring, faster alerts
MTTA (Acknowledge)Time from alert to human responseOn-call processes, alert routing
MTTR (Recovery)Time to fix and restore serviceRunbooks, rollback automation
MTBF (Between Failures)Time between incidentsCode quality, testing, reliability

Total incident impact = MTTD + MTTA + MTTR

MTTD is the easiest to reduce because it's purely a monitoring problem — no human decision-making required.

MTTD Benchmarks by Monitoring Approach

ApproachTypical MTTDWhy
Customer reports (Twitter, email)30-120 minutesYou wait for users to tell you
External pings (UptimeRobot, Pingdom)1-5 minutesCheck intervals + alert delay
APM tools (Datadog, New Relic)1-3 minutesSampling + evaluation windows
Internal SDK (Nurbak Watch)< 10 secondsEvery request monitored, instant alerts
Elite SRE teams target< 60 secondsMulti-layer monitoring + automation

Why MTTD Is High (and How to Fix It)

1. Health checks don't test what breaks

Your /api/health returns 200 while /api/checkout returns 500. The health check tests database connectivity — it doesn't test business logic. Fix: monitor every endpoint, not just the health check.

2. Check intervals are too long

A 5-minute check interval means up to 5 minutes of undetected downtime. A 1-minute interval is better but still misses sub-minute outages. Fix: use real-traffic monitoring instead of interval-based pings.

3. Alert evaluation windows add delay

Most tools require "condition met for X minutes" before alerting (to avoid false positives). A 2-minute evaluation window on a 1-minute check interval means 3+ minutes minimum MTTD. Fix: use instant alerting with deduplication instead of evaluation windows.

4. Alerts go to the wrong channel

An email alert at 3 AM has a read time of 5+ hours. A Slack notification during a busy day gets buried. Fix: use WhatsApp or phone calls for critical alerts — channels that bypass DND.

5. Partial failures are invisible

3% of requests to /api/payments fail due to a race condition. External monitoring never catches it because the synthetic ping always succeeds. Fix: monitor error rates from real traffic, not synthetic checks.

How to Achieve Sub-10-Second MTTD

Nurbak Watch monitors every API route from inside your Next.js server. It sees every request — not samples, not pings — and alerts within 10 seconds of detecting an anomaly.

    // instrumentation.ts
import { initWatch } from '@nurbak/watch'

export function register() {
  initWatch({
    apiKey: process.env.NURBAK_WATCH_KEY,
  })
}

What makes MTTD under 10 seconds possible:

  • Every request monitored — no sampling, no intervals. The first failed request triggers detection.
  • Instant alerting — Slack, email, or WhatsApp in under 10 seconds. No evaluation windows.
  • Per-endpoint monitoring/api/checkout breaking doesn't depend on /api/health being checked.
  • WhatsApp alerts — bypass DND at 3 AM. 98% read rate vs 20% for email.

Free during beta. 5 lines of code. MTTD measured in seconds, not minutes.

Related Articles