Rate Limiting Algorithms: Token & Leaky Bucket (2026)

Q: How does nginx rate limiting work

Nginx uses the Leaky Bucket algorithm via the limit_req directive. You define a rate (requests per second) and a burst size. Requests within the rate pass through immediately. Burst requests are queued unless nodelay is specified. Requests beyond the burst are rejected with 503 (or a custom status code).

The Outage That Taught Me Fixed-Window Is the Wrong Default

The bug that taught me this: a billing API I was responsible for had a standard "100 requests per minute per client" rate limit, implemented as a Fixed Window Counter in Redis. INCR rate:client_id:202601150847, check the count, reject at 101. Easy. Shipped in 2019, worked for two years, handled tens of millions of requests without incident. Then one Saturday afternoon a single customer's integration started timing out on their end and retrying aggressively. What we saw in our metrics was not 100 requests per minute -- it was a clean 200 requests landing in roughly 90 seconds, every single minute. Postgres CPU climbed to 94%, our checkout p99 latency crossed 8 seconds, and five unrelated customers started paging support because their invoices would not load.

The rate limit was "working" perfectly. The customer sent 100 requests in the window [00:59:00, 00:59:59] -- all accepted, counter at 100. The counter reset at 00:59:59.999. They sent another 100 in [01:00:00, 01:00:15] -- all accepted, counter at 100 again. Net effect: 200 requests in a 15-second window around the boundary. Over the next 40 minutes of their retry loop, we saw the same pattern every minute. The boundary problem in Fixed Window Counter -- a well-documented weakness of the algorithm -- turned a single misbehaving client into a partial outage.

The fix was not a bigger Postgres. The fix was switching the algorithm. Token Bucket with a bucket size of 100 and a refill rate of 100/60 per second would have rejected the 101st request no matter how it clustered in time, and the whole incident would have been one rejected request instead of a cascading timeout. That is the kind of decision this article is designed to make correctly before the outage, not after. I cover the four algorithms that dominate production rate limiting -- Fixed Window Counter, Sliding Window Log, Leaky Bucket, and Token Bucket -- with real implementations in Redis and nginx, the memory/accuracy/burst-tolerance trade-offs for each, and the point at which a single-node rate limiter stops being enough and you need a distributed one.

Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals), count requests in the current window, and reject requests once the count exceeds the limit.

How It Works

Define a window size (e.g., 60 seconds) and a request limit (e.g., 100 requests).
When a request arrives, determine the current window based on the timestamp (e.g., floor to the nearest minute).
Increment the counter for that window.
If the counter exceeds the limit, reject the request with HTTP 429.
When a new window starts, the counter resets to zero.

# Fixed Window Counter -- Python + Redis
import redis
import time

r = redis.Redis()

def is_allowed(client_id: str, limit: int = 100, window: int = 60) -> bool:
    current_window = int(time.time() // window)
    key = f"rate:{client_id}:{current_window}"

    current = r.incr(key)
    if current == 1:
        r.expire(key, window)

    return current <= limit

Pros: Simple, low memory (one counter per client per window), O(1) operations.

Cons: The boundary problem. A client can send 100 requests at 0:59 and 100 more at 1:01, effectively doubling the rate at the window boundary. This burst at the boundary is the algorithm's biggest weakness.

Sliding Window Log

Fixes the boundary problem by tracking the timestamp of every request. Instead of counting within fixed intervals, you look at a rolling window of time.

How It Works

When a request arrives, record its timestamp in a sorted set.
Remove all entries older than the window size.
Count remaining entries.
If the count exceeds the limit, reject the request.

# Sliding Window Log -- Python + Redis
import redis
import time

r = redis.Redis()

def is_allowed(client_id: str, limit: int = 100, window: int = 60) -> bool:
    now = time.time()
    key = f"rate:{client_id}"

    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, now - window)  # Remove old entries
    pipe.zadd(key, {str(now): now})              # Add current request
    pipe.zcard(key)                               # Count entries
    pipe.expire(key, window)                      # TTL cleanup
    results = pipe.execute()

    count = results[2]
    return count <= limit

Pros: No boundary problem -- the window slides smoothly. Perfectly accurate rate counting.

Cons: Memory-heavy. Every request stores a timestamp. At 1,000 requests per second per client, that's 60,000 entries in the sorted set for a 60-second window. Not viable for high-throughput APIs.

Pro tip: The Sliding Window Counter is a hybrid that combines Fixed Window and Sliding Window Log. It uses counters from the current and previous windows, weighted by how far into the current window you are. This gives you near-perfect accuracy with the memory efficiency of fixed windows. It's what most production rate limiters actually use.

Leaky Bucket

Models rate limiting as a bucket that leaks at a constant rate. Requests fill the bucket; if it overflows, requests are rejected. The leak rate determines your sustained throughput.

How It Works

Define a bucket capacity (maximum burst size) and a leak rate (requests per second).
When a request arrives, calculate how much the bucket has drained since the last request.
Subtract the drained amount from the current water level.
If adding the new request would overflow the bucket, reject it.
Otherwise, add the request to the bucket.

# Leaky Bucket -- Python + Redis with Lua script
import redis

r = redis.Redis()

LEAKY_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local leak_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('hmget', key, 'water', 'last_leak')
local water = tonumber(bucket[1]) or 0
local last_leak = tonumber(bucket[2]) or now

-- Calculate leaked amount
local elapsed = now - last_leak
local leaked = elapsed * leak_rate
water = math.max(0, water - leaked)

-- Check if request fits
if water + 1 > capacity then
    redis.call('hmset', key, 'water', water, 'last_leak', now)
    redis.call('expire', key, math.ceil(capacity / leak_rate) + 1)
    return 0  -- rejected
end

-- Accept request
water = water + 1
redis.call('hmset', key, 'water', water, 'last_leak', now)
redis.call('expire', key, math.ceil(capacity / leak_rate) + 1)
return 1  -- allowed
"""

leaky_script = r.register_script(LEAKY_BUCKET_SCRIPT)

def is_allowed(client_id: str, capacity: int = 10, leak_rate: float = 1.0) -> bool:
    import time
    key = f"rate:{client_id}"
    result = leaky_script(keys=[key], args=[capacity, leak_rate, time.time()])
    return bool(result)

Pros: Smooths out bursts into a constant output rate. Prevents the thundering herd problem. Memory-efficient -- two values per client.

Cons: Not great when you want to allow legitimate bursts. A client that's been idle can't burst even though they haven't used their quota.

Token Bucket

The most widely used rate limiting algorithm. Tokens are added to a bucket at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which determines the maximum burst size.

How It Works

Define a bucket capacity (max tokens) and a refill rate (tokens per second).
When a request arrives, calculate how many tokens have been added since the last request.
Add those tokens to the bucket (up to the capacity).
If there's at least one token, consume it and allow the request.
If the bucket is empty, reject the request.

# Token Bucket -- Python + Redis with Lua script
import redis
import time

r = redis.Redis()

TOKEN_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)

-- Check if request can proceed
if tokens < 1 then
    redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
    redis.call('expire', key, math.ceil(capacity / refill_rate) + 1)
    return 0  -- rejected
end

-- Consume token
tokens = tokens - 1
redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
redis.call('expire', key, math.ceil(capacity / refill_rate) + 1)
return 1  -- allowed
"""

token_script = r.register_script(TOKEN_BUCKET_SCRIPT)

def is_allowed(client_id: str, capacity: int = 100, refill_rate: float = 10.0) -> bool:
    key = f"rate:{client_id}"
    result = token_script(keys=[key], args=[capacity, refill_rate, time.time()])
    return bool(result)

Pros: Allows bursts up to the bucket capacity while maintaining an average rate. Memory-efficient. The most intuitive model for API rate limits.

Cons: A client can consume their entire burst allowance instantly and then face a drought. You need to tune both capacity and refill rate.

Algorithm Comparison

Algorithm	Memory Per Client	Burst Handling	Boundary Problem	Complexity	Best For
Fixed Window	O(1) -- one counter	Allows 2x burst at boundary	Yes	Low	Simple rate limits, internal APIs
Sliding Window Log	O(N) -- one entry per request	No bursts beyond limit	No	Medium	Low-volume, accuracy-critical
Leaky Bucket	O(1) -- two values	Smooths all bursts	No	Medium	Traffic shaping, constant output
Token Bucket	O(1) -- two values	Allows controlled bursts	No	Medium	API rate limiting, most use cases

Watch out: All Redis-based rate limiting implementations must use Lua scripts for atomicity. If you split the read-check-write into separate Redis commands, you'll have race conditions under concurrent load. A Lua script executes atomically in Redis -- there's no way for another request to interleave.

nginx Rate Limiting with limit_req

Nginx has built-in rate limiting using the leaky bucket algorithm. It's the fastest option for edge rate limiting because it operates at the reverse proxy layer before your application code runs.

# nginx.conf -- rate limiting configuration
http {
    # Define a rate limit zone
    # $binary_remote_addr = client IP (compact binary form)
    # zone=api:10m = 10MB shared memory zone named "api"
    # rate=10r/s = 10 requests per second
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

    # Custom 429 response
    limit_req_status 429;

    server {
        listen 80;

        location /api/ {
            # burst=20 allows 20 requests to queue beyond the rate
            # nodelay processes queued requests immediately
            limit_req zone=api burst=20 nodelay;

            proxy_pass http://backend;
        }
    }
}

The burst parameter is the bucket capacity. Without nodelay, excess requests are queued and processed at the leak rate. With nodelay, queued requests are processed immediately but still count against the burst limit.

Per-API-Key Rate Limiting in nginx

# Rate limit by API key instead of IP
map $http_x_api_key $api_key {
    default $http_x_api_key;
    ""      $binary_remote_addr;
}

limit_req_zone $api_key zone=api_key:10m rate=100r/s;

location /api/ {
    limit_req zone=api_key burst=50 nodelay;
    proxy_pass http://backend;
}

Distributed Rate Limiting

Single-node rate limiting breaks when you have multiple API servers behind a load balancer. A client with a 100 req/s limit can send 100 req/s to each of your five servers, effectively getting 500 req/s.

Solutions for Distributed Rate Limiting

Centralized Redis -- all servers check the same Redis instance. This is the most common approach. The Lua scripts above work unchanged in a distributed setup.
API Gateway -- offload rate limiting to your API gateway (Kong, AWS API Gateway, Envoy). The gateway is a single choke point, so rate limits are inherently global.
Local + Global hybrid -- each server enforces a local rate limit (e.g., limit / num_servers) and periodically syncs with a global counter. Reduces Redis round trips at the cost of some inaccuracy.
Consistent hashing -- route all requests from a given client to the same server. The rate limit is then local. This breaks if servers are added or removed.

Rate Limiting Headers and HTTP 429

Good rate limiting is transparent. Clients should know their limits, current usage, and when they can retry. Use these standard headers:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1712534400

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Try again in 30 seconds.",
    "retryAfter": 30
  }
}

Pro tip: Always include the Retry-After header in 429 responses. Well-behaved clients use it to back off automatically. Without it, clients often retry immediately in a tight loop, making the overload worse. The IETF draft RateLimit-* headers are becoming the standard -- adopt them early.

Managed Rate Limiting Services

If you don't want to build rate limiting yourself, several managed services handle it at the edge:

Service	Rate Limiting Approach	Estimated Cost	Best For
AWS API Gateway	Token bucket per API key/stage	Included in API Gateway pricing	AWS-native APIs
Cloudflare Rate Limiting	Fixed window + IP/header rules	$0.05 per 10K good requests	Edge-level protection
Kong Gateway	Fixed window, sliding window, Redis-backed	$150+/month (Cloud)	Multi-protocol gateways
Redis Enterprise	Your algorithm + Redis infrastructure	$65+/month	Custom rate limiting logic
Upstash Redis	Serverless Redis for rate limiting	Pay per command ($0.2/100K)	Serverless, low-volume APIs

Frequently Asked Questions

Which rate limiting algorithm should I use?

Token Bucket for most API rate limiting -- it allows controlled bursts while maintaining an average rate. Leaky Bucket if you need strict traffic shaping with constant output. Fixed Window Counter if simplicity matters more than accuracy. Sliding Window Log only for low-volume, accuracy-critical scenarios.

What is the boundary problem in fixed window rate limiting?

A client can make the maximum number of requests at the end of one window and the same number at the start of the next, effectively doubling their rate for a brief period at the boundary. The Sliding Window Counter solves this by weighting requests from the previous window based on overlap.

Why use Redis Lua scripts for rate limiting?

Redis Lua scripts execute atomically -- no other commands can interleave during execution. Without atomic operations, concurrent requests create race conditions where two requests both read the counter, both see room under the limit, and both increment. Lua scripts eliminate this class of bugs entirely.

How does nginx rate limiting work?

Nginx uses the Leaky Bucket algorithm via the limit_req directive. You define a rate (requests per second) and a burst size. Requests within the rate pass through immediately. Burst requests are queued unless nodelay is specified. Requests beyond the burst are rejected with 503 (or a custom status code).

What HTTP status code should I return for rate-limited requests?

HTTP 429 (Too Many Requests) is the correct status code, defined in RFC 6585. Always include a Retry-After header indicating when the client can retry. Some APIs also include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers so clients can self-regulate.

How do I rate limit by API key instead of IP address?

Use the API key as the rate limiting key instead of the client IP. In Redis-based implementations, replace the IP in the key string with the API key. In nginx, use a map directive to extract the key from a header. This lets you assign different rate limits to different API key tiers.

What is the difference between rate limiting and throttling?

Rate limiting rejects requests that exceed a defined threshold. Throttling slows down (queues or delays) requests rather than rejecting them. The Leaky Bucket algorithm naturally throttles by queuing excess requests. In practice, the terms are often used interchangeably, but the distinction matters when designing client-facing behavior.

Start Simple, Then Tune

Don't over-engineer your first rate limiter. A Fixed Window Counter in Redis handles most cases and takes 15 minutes to implement. Once you see traffic patterns that expose its boundary problem, upgrade to a Sliding Window Counter or Token Bucket. Instrument your rate limiting with metrics -- track rejection rates, which clients hit limits, and whether your limits are too tight or too loose. The best rate limiter is the one you can adjust based on real data.

Rate Limiting: Token Bucket, Leaky Bucket, Sliding Window, and Fixed Counter

The Outage That Taught Me Fixed-Window Is the Wrong Default

Fixed Window Counter

How It Works

Sliding Window Log

How It Works

Leaky Bucket

How It Works

Token Bucket

How It Works

Algorithm Comparison

nginx Rate Limiting with limit_req

Per-API-Key Rate Limiting in nginx

Distributed Rate Limiting

Solutions for Distributed Rate Limiting

Rate Limiting Headers and HTTP 429

Managed Rate Limiting Services

Frequently Asked Questions

Which rate limiting algorithm should I use?

What is the boundary problem in fixed window rate limiting?

Why use Redis Lua scripts for rate limiting?

How does nginx rate limiting work?

What HTTP status code should I return for rate-limited requests?

How do I rate limit by API key instead of IP address?

What is the difference between rate limiting and throttling?

Start Simple, Then Tune

Related Articles

Enjoyed this article?

Comments

Leave a comment

Stay in the loop