What is MTTD (Mean Time to Detect)?

MTTD (Mean Time to Detect) is the average time between when an incident begins and when your team becomes aware of it. For example, if your API starts returning 500 errors at 2:00 AM and your monitoring alert fires at 2:12 AM, the MTTD for that incident is 12 minutes. Lower MTTD means you catch problems faster, reducing the total impact on users.

Industry benchmarks for MTTD vary by monitoring approach: external ping monitoring (UptimeRobot, Pingdom) typically has 1-5 minute MTTD due to check intervals, APM tools (Datadog, New Relic) achieve 1-3 minute MTTD with sampling and alert evaluation windows, and internal SDK monitoring (Nurbak Watch) can achieve under 10 second MTTD by detecting issues from real traffic in real-time. Elite DevOps teams target MTTD under 60 seconds for critical endpoints.

What is the difference between MTTD and MTTR?

MTTD (Mean Time to Detect) measures how long it takes to discover an incident. MTTR (Mean Time to Recovery) measures how long it takes to fix it after detection. Total incident duration = MTTD + MTTR. Reducing MTTD is often easier and more impactful than reducing MTTR, because detection can be automated while fixes usually require human intervention.

MTTD Explained: Measure & Reduce Mean Time to Detect (2026)

Your /api/checkout started returning 500 errors at 2:13 AM. Your external monitor pinged the health check at 2:15 AM — it returned 200 (the health check doesn't test checkout). Your on-call engineer got paged at 3:00 AM when a customer tweeted about failed payments.

MTTD for that incident: 47 minutes. Nearly an hour of lost revenue before anyone knew.

Mean Time to Detect (MTTD) is the most underrated metric in incident management. Everyone focuses on MTTR (time to fix). But you can't fix what you don't know is broken.

What is MTTD?

MTTD = time when incident is detected − time when incident began

It measures the gap between "something broke" and "someone knows." This gap is pure waste — every second of undetected downtime is a second of user impact with zero response activity.

    Incident timeline:
─────────────────────────────────────────────────
2:13 AM  │ API starts returning 500s
         │
         │  ← MTTD: 47 minutes (nobody knows yet)
         │
3:00 AM  │ Customer tweets, engineer paged
         │
         │  ← MTTR: 23 minutes (fixing the issue)
         │
3:23 AM  │ Fix deployed, API recovered
─────────────────────────────────────────────────
Total incident: 70 minutes
Detection took: 67% of total time

In this example, detection was 67% of the total incident duration. Cutting MTTD from 47 minutes to 1 minute would have reduced total impact from 70 minutes to 24 minutes — without changing how fast the team fixes things.

MTTD vs MTTR vs MTTA vs MTBF

Metric	Measures	You control it by
MTTD (Detect)	Time to discover the incident	Better monitoring, faster alerts
MTTA (Acknowledge)	Time from alert to human response	On-call processes, alert routing
MTTR (Recovery)	Time to fix and restore service	Runbooks, rollback automation
MTBF (Between Failures)	Time between incidents	Code quality, testing, reliability

Total incident impact = MTTD + MTTA + MTTR

MTTD is the easiest to reduce because it's purely a monitoring problem — no human decision-making required.

MTTD Benchmarks by Monitoring Approach

Approach	Typical MTTD	Why
Customer reports (Twitter, email)	30-120 minutes	You wait for users to tell you
External pings (UptimeRobot, Pingdom)	1-5 minutes	Check intervals + alert delay
APM tools (Datadog, New Relic)	1-3 minutes	Sampling + evaluation windows
Internal SDK (Nurbak Watch)	< 10 seconds	Every request monitored, instant alerts
Elite SRE teams target	< 60 seconds	Multi-layer monitoring + automation

Why MTTD Is High (and How to Fix It)

1. Health checks don't test what breaks

Your /api/health returns 200 while /api/checkout returns 500. The health check tests database connectivity — it doesn't test business logic. Fix: monitor every endpoint, not just the health check.

2. Check intervals are too long

A 5-minute check interval means up to 5 minutes of undetected downtime. A 1-minute interval is better but still misses sub-minute outages. Fix: use real-traffic monitoring instead of interval-based pings.

3. Alert evaluation windows add delay

Most tools require "condition met for X minutes" before alerting (to avoid false positives). A 2-minute evaluation window on a 1-minute check interval means 3+ minutes minimum MTTD. Fix: use instant alerting with deduplication instead of evaluation windows.

4. Alerts go to the wrong channel

An email alert at 3 AM has a read time of 5+ hours. A Slack notification during a busy day gets buried. Fix: use WhatsApp or phone calls for critical alerts — channels that bypass DND.

5. Partial failures are invisible

3% of requests to /api/payments fail due to a race condition. External monitoring never catches it because the synthetic ping always succeeds. Fix: monitor error rates from real traffic, not synthetic checks.

How to Achieve Sub-10-Second MTTD

Nurbak Watch monitors every API route from inside your Next.js server. It sees every request — not samples, not pings — and alerts within 10 seconds of detecting an anomaly.

    // instrumentation.ts
import { initWatch } from '@nurbak/watch'

export function register() {
  initWatch({
    apiKey: process.env.NURBAK_WATCH_KEY,
  })
}

What makes MTTD under 10 seconds possible:

Every request monitored — no sampling, no intervals. The first failed request triggers detection.
Instant alerting — Slack, email, or WhatsApp in under 10 seconds. No evaluation windows.
Per-endpoint monitoring — /api/checkout breaking doesn't depend on /api/health being checked.
WhatsApp alerts — bypass DND at 3 AM. 98% read rate vs 20% for email.

Free during beta. 5 lines of code. MTTD measured in seconds, not minutes.

MTTD Explained: How to Measure and Reduce Mean Time to Detect

What is MTTD?

MTTD vs MTTR vs MTTA vs MTBF

MTTD Benchmarks by Monitoring Approach

Why MTTD Is High (and How to Fix It)

1. Health checks don't test what breaks

2. Check intervals are too long

3. Alert evaluation windows add delay

4. Alerts go to the wrong channel

5. Partial failures are invisible

How to Achieve Sub-10-Second MTTD

Related Articles

Fabian Delgado

Start monitoring your APIs for free

What is MTTD?

MTTD vs MTTR vs MTTA vs MTBF

MTTD Benchmarks by Monitoring Approach

Why MTTD Is High (and How to Fix It)

1. Health checks don't test what breaks

2. Check intervals are too long

3. Alert evaluation windows add delay

4. Alerts go to the wrong channel

5. Partial failures are invisible

How to Achieve Sub-10-Second MTTD

Related Articles

Fabian Delgado

Start monitoring your APIs for free

Read Next

SLO vs SLA vs SLI: What's the Difference? (With Examples)

The Incident Response Lifecycle for API Teams (5 Steps)

What Does 99.99% Uptime Really Mean? (With Calculator)