All Posts
APIsPerformanceArchitecture

REST API Rate Limiting: Patterns That Actually Scale

Kiran MayeeAugust 15, 20259 min read

Rate limiting is one of those API features that looks simple from the outside ("just reject requests after X per minute") and reveals significant depth once you implement it. Choose the wrong algorithm and you block legitimate users during burst traffic. Choose the wrong storage layer and your limiter doesn't work under horizontal scaling.

This guide covers every common rate limiting pattern, when to use each, and how to implement them in a way that survives production traffic.

Why Rate Limiting Matters

  • Cost protection — a misconfigured client or a malicious actor can exhaust database connections, third-party API credits, or worker processes in minutes.
  • Fairness — without limits, one power user can degrade the experience for everyone else.
  • Security — rate limits are the first line of defence against credential stuffing, enumeration, and scraping attacks.

Algorithm 1: Fixed Window

The simplest algorithm: count requests in a fixed time window (e.g., 100 requests per 60-second window). When the counter hits 100, reject until the window resets.

Problem: A client can send 100 requests at second 59, then 100 more at second 61 — 200 requests in two seconds without violating the limit technically. This is the "thundering herd at window boundary" problem.

Algorithm 2: Sliding Window Log

Store a timestamp for every request. When a new request arrives, discard timestamps older than the window, count remaining entries, and allow if under the limit. Accurate but memory-intensive — each request stores a timestamp.

Algorithm 3: Sliding Window Counter (Recommended)

The best practical choice. Combine the current window's count and a weighted portion of the previous window's count:

rate = prev_window_count × (1 - elapsed_in_current_window/window_size)
     + current_window_count

This approximates a true sliding window with only two stored counters. Accurate, lightweight, and race-condition-safe when implemented in Redis with atomic operations.

Algorithm 4: Token Bucket

Tokens accumulate at a fixed rate (e.g., 10 tokens/second) up to a maximum bucket capacity. Each request consumes one token. Empty bucket = rejected request. This naturally handles burst traffic — if a user hasn't made requests for 10 seconds, they've banked 100 tokens for a quick burst.

-- Redis Lua script for token bucket (atomic)
local tokens = tonumber(redis.call("GET", KEYS[1])) or ARGV[1]
local last_refill = tonumber(redis.call("GET", KEYS[2])) or ARGV[2]
local now = tonumber(ARGV[2])
local refill_rate = tonumber(ARGV[3])
local capacity = tonumber(ARGV[4])

local new_tokens = math.min(capacity, tokens + (now - last_refill) * refill_rate)
if new_tokens >= 1 then
  redis.call("SET", KEYS[1], new_tokens - 1)
  redis.call("SET", KEYS[2], now)
  return 1
else
  return 0
end

Algorithm 5: Leaky Bucket

Requests enter a queue (the bucket). They're processed at a fixed rate and overflow if the bucket is full. Unlike token bucket, leaky bucket enforces a perfectly constant outflow rate — useful for protecting a downstream service that can't handle bursts even for a moment.

What to Rate Limit On

  • IP address — cheap to implement but easily bypassed with rotating proxies. Good as a last resort.
  • API key / user ID — the right granularity for authenticated APIs. Ties limits to accountable identities.
  • Endpoint — expensive endpoints (AI generation, report exports) deserve tighter limits than cheap reads.
  • Composite — combine user + endpoint: 1000 reads/hour but only 100 writes/hour per user.

Returning Useful Headers

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1724025600
Retry-After: 47
Content-Type: application/json

{"error": "rate_limit_exceeded", "message": "Try again in 47 seconds"}

Well-formed rate limit responses let API clients implement automatic backoff correctly instead of hammering your endpoint in a retry loop.

Distributed Rate Limiting

A counter stored in one server's memory is useless when you have ten servers behind a load balancer. Use Redis (or any shared store) with Lua scripts or WATCH/MULTI transactions to make counter increments atomic across processes. For Cloudflare Workers or edge functions, Durable Objects provide the same guarantee at the edge.

Testing Your Rate Limiter

Use moqapi.dev's chaos testing feature to simulate 429 responses from upstream APIs while testing your client's retry logic. Configure error code 429 at 20% injection rate and verify that your retry handler correctly reads Retry-After headers and backs off appropriately.

Share this article:

About the Author

Kiran Mayee

Founder and sole developer of moqapi.dev. Full-stack engineer with deep experience in API platforms, serverless runtimes, and developer tooling. Built moqapi to solve the mock data and deployment friction she experienced firsthand building production APIs.

Ready to build?

Start deploying serverless functions in under a minute.

Get Started Free