Your REST API is "up." Congratulations. That tells you almost nothing.

Uptime means the server responds. It doesn't tell you that /api/checkout is taking 4 seconds instead of 400 milliseconds. It doesn't tell you that 3% of requests to /api/users are returning 500 errors. It doesn't tell you that your most critical endpoint is 10x slower during peak hours.

Uptime is a binary metric. Real API monitoring is about understanding how well your API is performing — across every endpoint, every minute, every percentile.

This guide covers the 5 metrics every REST API should track, how to measure each one, and which tools to use depending on your team size and stack.

Metric 1: Uptime — But Measured Correctly

Uptime is the most basic metric, but most teams measure it wrong.

What most teams do: An external service pings /api/health every 60 seconds. If it returns 200, the API is "up." This gives you a number like 99.9% — which sounds great until you realize it means 8.7 hours of downtime per year, measured in 60-second increments that miss everything in between.

What you should do: Calculate uptime from real request data. If you served 1,000,000 requests this month and 2,000 returned 5xx errors, your effective uptime is 99.8% — regardless of what the health check says.

// Real uptime calculation
const totalRequests = 1_000_000
const serverErrors = 2_000  // 5xx responses only
const effectiveUptime = ((totalRequests - serverErrors) / totalRequests) * 100
// 99.8% — more accurate than any ping-based check

Targets by tier:

SLAAllowed downtime/yearTypical for
99.0%3.65 daysInternal tools, staging
99.9%8.7 hoursMost SaaS products
99.95%4.4 hoursPayment / auth APIs
99.99%52 minutesInfrastructure APIs (AWS, Stripe)

Metric 2: Latency Percentiles — P50, P95, P99

Average response time is a lie. If 99 requests take 50ms and 1 request takes 10 seconds, the average is 149ms. That single number hides the fact that 1% of your users are having a terrible experience.

Percentiles tell the real story:

  • P50 (median) — The typical experience. 50% of requests are faster than this. If your P50 is 80ms, most users are happy.
  • P95 — The experience of your slowest 5% of users. This catches slow database queries, cold starts, and n+1 problems. If your P95 is 2 seconds, 1 in 20 requests is painfully slow.
  • P99 — The worst 1%. This catches connection pool exhaustion, garbage collection pauses, and timeout cascades. If your P99 is 8 seconds, your most active users (who make the most requests) will hit this regularly.

Why P99 matters more than you think: A user who makes 100 API calls per session has a 63% chance of experiencing the P99 latency at least once. For your power users, P99 is the experience.

// The math: probability of NOT hitting P99 in N requests
// P(never hitting P99) = 0.99^N
// For N=100: 0.99^100 = 0.366 → 63.4% chance of hitting P99 at least once

// Latency percentiles per endpoint — what you should see in your dashboard
// GET /api/users     → P50: 45ms  | P95: 120ms  | P99: 340ms   ✅
// GET /api/products  → P50: 80ms  | P95: 450ms  | P99: 2100ms  ⚠️
// POST /api/checkout → P50: 200ms | P95: 1800ms | P99: 8500ms  🔴

Targets: P50 under 100ms, P95 under 500ms, P99 under 2 seconds. Anything above P99 of 5 seconds needs immediate investigation.

Metric 3: Error Rate by Endpoint

A global error rate of 0.5% feels fine. But what if all those errors come from one endpoint?

// Global view: 0.5% error rate — looks fine
// Per-endpoint view:
// GET  /api/users     → 0.01% errors  ✅
// GET  /api/products  → 0.02% errors  ✅
// POST /api/checkout  → 12.4% errors  🔴  ← This is where all the errors are
// GET  /api/analytics → 0.00% errors  ✅

Per-endpoint error rates reveal problems that global metrics hide entirely. Your checkout endpoint could be failing for 1 in 8 users while your overall error rate looks healthy.

What to track:

  • 4xx rate — Client errors. A sudden spike in 400s or 422s often means a frontend deployment broke request payloads. A spike in 401s means auth is broken.
  • 5xx rate — Server errors. These are always your fault. Any sustained 5xx rate above 0.1% on a critical endpoint needs investigation.
  • Error budget — If your SLA allows 0.1% errors, and you've used 80% of your monthly budget by the 15th, slow down deployments and focus on stability.

Targets: 5xx rate below 0.1% per endpoint. 4xx rate tracked for anomalies (no fixed target since some 4xx is normal).

Metric 4: Throughput — Requests per Minute

Throughput tells you how much traffic each endpoint handles. By itself it's informational, but combined with latency and error rates, it becomes diagnostic:

  • Throughput up + latency up = You're approaching capacity limits. Scale horizontally or optimize.
  • Throughput up + errors up = You're past capacity. Something is rejecting requests under load.
  • Throughput down + latency up = A dependency is slow and requests are queuing. Database or external API issue.
  • Throughput down + errors same = Traffic dropped. Could be normal (off-peak) or a problem (DNS, CDN, frontend broken).
// Throughput patterns to watch
// Normal day:
// 09:00 → 1,200 rpm → P95: 120ms → Errors: 0.02%
// 12:00 → 2,800 rpm → P95: 135ms → Errors: 0.03%  ← Peak, handling it fine
// 18:00 → 1,500 rpm → P95: 115ms → Errors: 0.01%

// Problem day:
// 09:00 → 1,200 rpm → P95: 120ms → Errors: 0.02%
// 12:00 → 2,800 rpm → P95: 890ms → Errors: 2.10%  ← Can't handle peak load
// 12:15 → 1,100 rpm → P95: 3200ms → Errors: 8.40% ← Cascading failure

Targets: No fixed target — track the baseline and alert on deviations (±30% from typical for that time of day).

Metric 5: Slow Endpoint Detection

Most monitoring tools let you set static thresholds: "alert if response time exceeds 2 seconds." This works until you have 30 endpoints with different normal ranges.

Slow endpoint detection means automatically identifying which routes are degrading relative to their own baseline:

EndpointNormal P95Current P95ChangeStatus
GET /api/users120ms125ms+4%Normal
GET /api/products80ms340ms+325%Degraded
POST /api/checkout200ms210ms+5%Normal
GET /api/search150ms4200ms+2700%Critical

A 2-second static threshold would miss /api/products at 340ms (it's under the threshold but 4x its normal speed). And /api/search at 4.2 seconds is obviously broken, but you'd want to know about the products endpoint too.

Monitoring Tools Compared

Three common approaches for REST API monitoring, depending on your team and budget:

Datadog

  • What it is: Full observability platform — APM, logs, infrastructure, synthetic checks
  • How it works: Agent daemon (300-500MB RAM) + language library (dd-trace)
  • Cost: $71/host/month (APM) + $15/host/month (infrastructure). A team with 3 servers: ~$258/month minimum
  • Setup time: 2-4 hours. 10+ environment variables, YAML config, agent installation
  • Best for: Large teams with dedicated DevOps, running Kubernetes with 50+ services
  • Limitation for Next.js: The Datadog agent can't run on Vercel serverless. You get degraded "agentless" mode with higher latency and sampling

New Relic

  • What it is: Full-stack observability with APM, browser monitoring, and AI ops
  • How it works: Language agent (newrelic npm package) + cloud collector
  • Cost: Free tier (100GB data/month), then $49+/host/month. Data ingestion charges can spike unexpectedly
  • Setup time: 1-2 hours. Simpler than Datadog but still requires config file and multiple env vars
  • Best for: Mid-size teams that want full observability without Datadog's price tag
  • Limitation for Next.js: The Node.js agent adds 200-400ms to cold starts via monkey-patching. Partial serverless support

Nurbak Watch

  • What it is: Lightweight API monitoring SDK built for Next.js
  • How it works: Uses the Next.js instrumentation.ts hook — runs inside your server process, no agent
  • Cost: Free during beta. Pro plan: $29/month (no per-host pricing)
  • Setup time: 5 minutes. One npm install, 5 lines of code, 1 environment variable
  • Best for: Solo developers and small teams (1-15) running Next.js on Vercel or Node.js
  • Tracks: P50/P95/P99 latency, error rates, throughput, cold starts — all per endpoint, automatically
DatadogNew RelicNurbak Watch
Monthly cost (small team)$258+$147+$0 (beta) / $29
Setup time2-4 hours1-2 hours5 minutes
Lines of code50-100+20-505
Cold start impact+200-800ms+200-400ms+5-15ms
Works on Vercel serverlessPartiallyPartiallyFully
Auto-discovers API routesYes (with agent)Yes (with agent)Yes (native)
Per-endpoint P95/P99YesYesYes
WhatsApp alertsNoNoYes

Why Internal Monitoring Wins for REST APIs

External monitoring (pinging your API from outside) has fundamental blind spots for REST APIs:

  • It samples. A ping every 60 seconds tests 1 request per minute. Your API handles 2,000. That's 0.05% coverage.
  • It tests one endpoint. You have 20 routes. External monitors charge per endpoint, so most teams only monitor 2-3.
  • It can't see error rates. An external ping hits /api/health and gets 200. Meanwhile, /api/payments is returning 500 for 8% of real users.
  • It measures network + server time. A 200ms response from Virginia might be 50ms of server time and 150ms of network. You're optimizing the wrong thing.

Internal monitoring runs inside your server and sees every request. No sampling, no blind spots, real server-side timing. This is the difference between knowing your API is "reachable" and knowing it's actually working well.

Setup: Full REST API Monitoring in 5 Minutes

If you're running Next.js, here's how to go from zero to full monitoring with Nurbak Watch:

Step 1: Install

npm install @nurbak/watch

Step 2: Add instrumentation

// instrumentation.ts
import { initWatch } from '@nurbak/watch'

export function register() {
  initWatch({
apiKey: process.env.NURBAK_WATCH_KEY,
  })
}

Step 3: Add your API key

# .env.local (or Vercel dashboard)
NURBAK_WATCH_KEY=your_key_here

Step 4: Deploy

Within 60 seconds of your first request, you'll see every API route in the dashboard with:

  • P50, P95, P99 latency per endpoint
  • Error rate (4xx/5xx) per endpoint
  • Throughput (requests per minute)
  • Automatic slow endpoint flagging
  • Real uptime calculated from actual request data

Alerts go to Slack, email, or WhatsApp within 10 seconds of an incident. One message per incident, not one per failed request.

What to Do After Setup

Once monitoring is running, here's the playbook:

  1. Week 1: Observe. Don't set alert thresholds yet. Let the tool establish baselines for each endpoint.
  2. Week 2: Set P95 thresholds per endpoint based on observed baselines (2x the baseline is a good starting point).
  3. Week 3: Set error rate thresholds. 0.5% for critical endpoints (checkout, auth), 2% for everything else.
  4. Ongoing: Review weekly. Look for slow trends — a P95 that increases 10% per week will be a problem in a month even if it's fine today.

Get Started — Free During Beta

Nurbak Watch is in beta and completely free during launch. All 5 metrics covered in this guide — latency percentiles, error rates, throughput, uptime, and slow endpoint detection — tracked automatically for every API route.

One npm install. Five lines of code. Every metric, every endpoint, every request.

Related Articles