What metrics should I monitor for a REST API?

The five critical REST API metrics are: (1) Uptime — percentage of time your API is available, measured from real requests, not synthetic pings. (2) Latency percentiles — P50, P95, and P99 response times per endpoint. (3) Error rate — percentage of 4xx and 5xx responses per route. (4) Throughput — requests per minute per endpoint. (5) Slow endpoint detection — identifying which routes are degrading before they break.

What is the difference between P50, P95, and P99 latency?

P50 (median) is the response time that 50% of requests are faster than — it represents the typical user experience. P95 means 95% of requests are faster than this value — it captures the experience of users on slow connections or hitting complex queries. P99 means only 1% of requests are slower — it catches the worst-case scenarios like cold starts, connection pool exhaustion, or garbage collection pauses. Monitoring all three gives you a complete picture of your API's performance.

What tools can I use for REST API monitoring?

Common REST API monitoring tools include Datadog (full APM, $71+/host/month), New Relic (full observability, $49+/host/month), and Nurbak Watch (lightweight SDK for Next.js). Datadog and New Relic use external agents that add overhead and complexity. Nurbak Watch runs inside your server via the Next.js instrumentation hook, monitors every API route automatically, and requires only 5 lines of code to set up.

REST API Monitoring: 5 Key Metrics & Best Tools (2026)

Your REST API is "up." Congratulations. That tells you almost nothing.

Uptime means the server responds. It doesn't tell you that /api/checkout is taking 4 seconds instead of 400 milliseconds. It doesn't tell you that 3% of requests to /api/users are returning 500 errors. It doesn't tell you that your most critical endpoint is 10x slower during peak hours.

Uptime is a binary metric. Real API monitoring is about understanding how well your API is performing — across every endpoint, every minute, every percentile.

This guide covers the 5 metrics every REST API should track, how to measure each one, and which tools to use depending on your team size and stack.

Metric 1: Uptime — But Measured Correctly

Uptime is the most basic metric, but most teams measure it wrong.

What most teams do: An external service pings /api/health every 60 seconds. If it returns 200, the API is "up." This gives you a number like 99.9% — which sounds great until you realize it means 8.7 hours of downtime per year, measured in 60-second increments that miss everything in between.

What you should do: Calculate uptime from real request data. If you served 1,000,000 requests this month and 2,000 returned 5xx errors, your effective uptime is 99.8% — regardless of what the health check says.

// Real uptime calculation
const totalRequests = 1_000_000
const serverErrors = 2_000  // 5xx responses only
const effectiveUptime = ((totalRequests - serverErrors) / totalRequests) * 100
// 99.8% — more accurate than any ping-based check

Targets by tier:

SLA	Allowed downtime/year	Typical for
99.0%	3.65 days	Internal tools, staging
99.9%	8.7 hours	Most SaaS products
99.95%	4.4 hours	Payment / auth APIs
99.99%	52 minutes	Infrastructure APIs (AWS, Stripe)

Metric 2: Latency Percentiles — P50, P95, P99

Average response time is a lie. If 99 requests take 50ms and 1 request takes 10 seconds, the average is 149ms. That single number hides the fact that 1% of your users are having a terrible experience.

Percentiles tell the real story:

P50 (median) — The typical experience. 50% of requests are faster than this. If your P50 is 80ms, most users are happy.
P95 — The experience of your slowest 5% of users. This catches slow database queries, cold starts, and n+1 problems. If your P95 is 2 seconds, 1 in 20 requests is painfully slow.
P99 — The worst 1%. This catches connection pool exhaustion, garbage collection pauses, and timeout cascades. If your P99 is 8 seconds, your most active users (who make the most requests) will hit this regularly.

Why P99 matters more than you think: A user who makes 100 API calls per session has a 63% chance of experiencing the P99 latency at least once. For your power users, P99 is the experience.

// The math: probability of NOT hitting P99 in N requests
// P(never hitting P99) = 0.99^N
// For N=100: 0.99^100 = 0.366 → 63.4% chance of hitting P99 at least once

// Latency percentiles per endpoint — what you should see in your dashboard
// GET /api/users     → P50: 45ms  | P95: 120ms  | P99: 340ms  
// GET /api/products  → P50: 80ms  | P95: 450ms  | P99: 2100ms 
// POST /api/checkout → P50: 200ms | P95: 1800ms | P99: 8500ms

Targets: P50 under 100ms, P95 under 500ms, P99 under 2 seconds. Anything above P99 of 5 seconds needs immediate investigation.

Metric 3: Error Rate by Endpoint

A global error rate of 0.5% feels fine. But what if all those errors come from one endpoint?

// Global view: 0.5% error rate — looks fine
// Per-endpoint view:
// GET  /api/users     → 0.01% errors 
// GET  /api/products  → 0.02% errors 
// POST /api/checkout  → 12.4% errors   ← This is where all the errors are
// GET  /api/analytics → 0.00% errors

Per-endpoint error rates reveal problems that global metrics hide entirely. Your checkout endpoint could be failing for 1 in 8 users while your overall error rate looks healthy.

What to track:

4xx rate — Client errors. A sudden spike in 400s or 422s often means a frontend deployment broke request payloads. A spike in 401s means auth is broken.
5xx rate — Server errors. These are always your fault. Any sustained 5xx rate above 0.1% on a critical endpoint needs investigation.
Error budget — If your SLA allows 0.1% errors, and you've used 80% of your monthly budget by the 15th, slow down deployments and focus on stability.

Targets: 5xx rate below 0.1% per endpoint. 4xx rate tracked for anomalies (no fixed target since some 4xx is normal).

Metric 4: Throughput — Requests per Minute

Throughput tells you how much traffic each endpoint handles. By itself it's informational, but combined with latency and error rates, it becomes diagnostic:

Throughput up + latency up = You're approaching capacity limits. Scale horizontally or optimize.
Throughput up + errors up = You're past capacity. Something is rejecting requests under load.
Throughput down + latency up = A dependency is slow and requests are queuing. Database or external API issue.
Throughput down + errors same = Traffic dropped. Could be normal (off-peak) or a problem (DNS, CDN, frontend broken).

// Throughput patterns to watch
// Normal day:
// 09:00 → 1,200 rpm → P95: 120ms → Errors: 0.02%
// 12:00 → 2,800 rpm → P95: 135ms → Errors: 0.03%  ← Peak, handling it fine
// 18:00 → 1,500 rpm → P95: 115ms → Errors: 0.01%

// Problem day:
// 09:00 → 1,200 rpm → P95: 120ms → Errors: 0.02%
// 12:00 → 2,800 rpm → P95: 890ms → Errors: 2.10%  ← Can't handle peak load
// 12:15 → 1,100 rpm → P95: 3200ms → Errors: 8.40% ← Cascading failure

Targets: No fixed target — track the baseline and alert on deviations (±30% from typical for that time of day).

Metric 5: Slow Endpoint Detection

Most monitoring tools let you set static thresholds: "alert if response time exceeds 2 seconds." This works until you have 30 endpoints with different normal ranges.

Slow endpoint detection means automatically identifying which routes are degrading relative to their own baseline:

Endpoint	Normal P95	Current P95	Change	Status
`GET /api/users`	120ms	125ms	+4%	Normal
`GET /api/products`	80ms	340ms	+325%	Degraded
`POST /api/checkout`	200ms	210ms	+5%	Normal
`GET /api/search`	150ms	4200ms	+2700%	Critical

A 2-second static threshold would miss /api/products at 340ms (it's under the threshold but 4x its normal speed). And /api/search at 4.2 seconds is obviously broken, but you'd want to know about the products endpoint too.

Monitoring Tools Compared

Three common approaches for REST API monitoring, depending on your team and budget:

Datadog

What it is: Full observability platform — APM, logs, infrastructure, synthetic checks
How it works: Agent daemon (300-500MB RAM) + language library (dd-trace)
Cost: $71/host/month (APM) + $15/host/month (infrastructure). A team with 3 servers: ~$258/month minimum
Setup time: 2-4 hours. 10+ environment variables, YAML config, agent installation
Best for: Large teams with dedicated DevOps, running Kubernetes with 50+ services
Limitation for Next.js: The Datadog agent can't run on Vercel serverless. You get degraded "agentless" mode with higher latency and sampling

New Relic

What it is: Full-stack observability with APM, browser monitoring, and AI ops
How it works: Language agent (newrelic npm package) + cloud collector
Cost: Free tier (100GB data/month), then $49+/host/month. Data ingestion charges can spike unexpectedly
Setup time: 1-2 hours. Simpler than Datadog but still requires config file and multiple env vars
Best for: Mid-size teams that want full observability without Datadog's price tag
Limitation for Next.js: The Node.js agent adds 200-400ms to cold starts via monkey-patching. Partial serverless support

Nurbak Watch

What it is: Lightweight API monitoring SDK built for Next.js
How it works: Uses the Next.js instrumentation.ts hook — runs inside your server process, no agent
Cost: Pro plan: $29/month (no per-host pricing)
Setup time: 5 minutes. One npm install, 5 lines of code, 1 environment variable
Best for: Solo developers and small teams (1-15) running Next.js on Vercel or Node.js
Tracks: P50/P95/P99 latency, error rates, throughput, cold starts — all per endpoint, automatically

	Datadog	New Relic	Nurbak Watch
Monthly cost (small team)	$258+	$147+	$19 / $29
Setup time	2-4 hours	1-2 hours	5 minutes
Lines of code	50-100+	20-50	5
Cold start impact	+200-800ms	+200-400ms	+5-15ms
Works on Vercel serverless	Partially	Partially	Fully
Auto-discovers API routes	Yes (with agent)	Yes (with agent)	Yes (native)
Per-endpoint P95/P99	Yes	Yes	Yes
WhatsApp alerts	No	No	Yes

Why Internal Monitoring Wins for REST APIs

External monitoring (pinging your API from outside) has fundamental blind spots for REST APIs:

It samples. A ping every 60 seconds tests 1 request per minute. Your API handles 2,000. That's 0.05% coverage.
It tests one endpoint. You have 20 routes. External monitors charge per endpoint, so most teams only monitor 2-3.
It can't see error rates. An external ping hits /api/health and gets 200. Meanwhile, /api/payments is returning 500 for 8% of real users.
It measures network + server time. A 200ms response from Virginia might be 50ms of server time and 150ms of network. You're optimizing the wrong thing.

Internal monitoring runs inside your server and sees every request. No sampling, no blind spots, real server-side timing. This is the difference between knowing your API is "reachable" and knowing it's actually working well.

Setup: Full REST API Monitoring in 5 Minutes

If you're running Next.js, here's how to go from zero to full monitoring with Nurbak Watch:

Step 1: Install

npm install @nurbak/watch

Step 2: Add instrumentation

// instrumentation.ts
import { initWatch } from '@nurbak/watch'

export function register() {
  initWatch({
apiKey: process.env.NURBAK_WATCH_KEY,
  })
}

Step 3: Add your API key

# .env.local (or Vercel dashboard)
NURBAK_WATCH_KEY=your_key_here

Step 4: Deploy

Within 60 seconds of your first request, you'll see every API route in the dashboard with:

P50, P95, P99 latency per endpoint
Error rate (4xx/5xx) per endpoint
Throughput (requests per minute)
Automatic slow endpoint flagging
Real uptime calculated from actual request data

Alerts go to Slack, email, or WhatsApp within 10 seconds of an incident. One message per incident, not one per failed request.

What to Do After Setup

Once monitoring is running, here's the playbook:

Week 1: Observe. Don't set alert thresholds yet. Let the tool establish baselines for each endpoint.
Week 2: Set P95 thresholds per endpoint based on observed baselines (2x the baseline is a good starting point).
Week 3: Set error rate thresholds. 0.5% for critical endpoints (checkout, auth), 2% for everything else.
Ongoing: Review weekly. Look for slow trends — a P95 that increases 10% per week will be a problem in a month even if it's fine today.

Get Started

Nurbak Watch starts at $19/month flat. All 5 metrics covered in this guide — latency percentiles, error rates, throughput, uptime, and slow endpoint detection — tracked automatically for every API route.

One npm install. Five lines of code. Every metric, every endpoint, every request.

REST API Monitoring: What to Track & Tools to Use

Metric 1: Uptime — But Measured Correctly

Metric 2: Latency Percentiles — P50, P95, P99

Metric 3: Error Rate by Endpoint

Metric 4: Throughput — Requests per Minute

Metric 5: Slow Endpoint Detection

Monitoring Tools Compared

Datadog

New Relic

Nurbak Watch

Why Internal Monitoring Wins for REST APIs

Setup: Full REST API Monitoring in 5 Minutes

Step 1: Install

Step 2: Add instrumentation

Step 3: Add your API key

Step 4: Deploy

What to Do After Setup

Get Started

Related Articles

Fabián Delgado

Ready to try it?

Metric 1: Uptime — But Measured Correctly

Metric 2: Latency Percentiles — P50, P95, P99

Metric 3: Error Rate by Endpoint

Metric 4: Throughput — Requests per Minute

Metric 5: Slow Endpoint Detection

Monitoring Tools Compared

Datadog

New Relic

Nurbak Watch

Why Internal Monitoring Wins for REST APIs

Setup: Full REST API Monitoring in 5 Minutes

Step 1: Install

Step 2: Add instrumentation

Step 3: Add your API key

Step 4: Deploy

What to Do After Setup

Get Started

Related Articles

Fabián Delgado

Ready to try it?

Read Next

SSL Certificate Monitoring: Never Get Caught by an Expired Cert Again

Dead Man's Switch for Cron Jobs: Get Alerted When a Job Stops Running

SLO vs SLA vs SLI: What's the Difference? (With Examples)