Architecture

Rate Limiting: Token Bucket, Leaky Bucket, Sliding Window, and Fixed Counter

Compare the four main rate limiting algorithms -- Token Bucket, Leaky Bucket, Sliding Window Log, and Fixed Window Counter. Includes Redis Lua implementations, nginx configuration, distributed rate limiting patterns, and managed service pricing.

A
Abhishek Patel12 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Rate Limiting: Token Bucket, Leaky Bucket, Sliding Window, and Fixed Counter
Rate Limiting: Token Bucket, Leaky Bucket, Sliding Window, and Fixed Counter

Why Every Production API Needs Rate Limiting

Your API is live, traffic is growing, and then one client starts hammering your endpoint at 10,000 requests per second. Without rate limiting, that single client degrades the experience for every other user. Rate limiting isn't optional for production APIs -- it's the difference between a graceful "slow down" response and a crashed server. The question isn't whether to implement it, but which algorithm fits your traffic pattern.

Four algorithms dominate production rate limiting: Fixed Window Counter, Sliding Window Log, Leaky Bucket, and Token Bucket. Each makes different trade-offs between memory usage, accuracy, and burst tolerance. I'll walk through each one, show you real implementations in Redis and nginx, and explain when distributed rate limiting becomes necessary.

What Is Rate Limiting?

Definition: Rate limiting is a technique that controls the number of requests a client can make to an API within a specified time window. It protects backend services from overload, ensures fair resource distribution among clients, and defends against abuse, denial-of-service attacks, and runaway automation scripts.

Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals), count requests in the current window, and reject requests once the count exceeds the limit.

How It Works

  1. Define a window size (e.g., 60 seconds) and a request limit (e.g., 100 requests).
  2. When a request arrives, determine the current window based on the timestamp (e.g., floor to the nearest minute).
  3. Increment the counter for that window.
  4. If the counter exceeds the limit, reject the request with HTTP 429.
  5. When a new window starts, the counter resets to zero.
# Fixed Window Counter -- Python + Redis
import redis
import time

r = redis.Redis()

def is_allowed(client_id: str, limit: int = 100, window: int = 60) -> bool:
    current_window = int(time.time() // window)
    key = f"rate:{client_id}:{current_window}"

    current = r.incr(key)
    if current == 1:
        r.expire(key, window)

    return current <= limit

Pros: Simple, low memory (one counter per client per window), O(1) operations.

Cons: The boundary problem. A client can send 100 requests at 0:59 and 100 more at 1:01, effectively doubling the rate at the window boundary. This burst at the boundary is the algorithm's biggest weakness.

Sliding Window Log

Fixes the boundary problem by tracking the timestamp of every request. Instead of counting within fixed intervals, you look at a rolling window of time.

How It Works

  1. When a request arrives, record its timestamp in a sorted set.
  2. Remove all entries older than the window size.
  3. Count remaining entries.
  4. If the count exceeds the limit, reject the request.
# Sliding Window Log -- Python + Redis
import redis
import time

r = redis.Redis()

def is_allowed(client_id: str, limit: int = 100, window: int = 60) -> bool:
    now = time.time()
    key = f"rate:{client_id}"

    pipe = r.pipeline()
    pipe.zremrangebyscore(key, 0, now - window)  # Remove old entries
    pipe.zadd(key, {str(now): now})              # Add current request
    pipe.zcard(key)                               # Count entries
    pipe.expire(key, window)                      # TTL cleanup
    results = pipe.execute()

    count = results[2]
    return count <= limit

Pros: No boundary problem -- the window slides smoothly. Perfectly accurate rate counting.

Cons: Memory-heavy. Every request stores a timestamp. At 1,000 requests per second per client, that's 60,000 entries in the sorted set for a 60-second window. Not viable for high-throughput APIs.

Pro tip: The Sliding Window Counter is a hybrid that combines Fixed Window and Sliding Window Log. It uses counters from the current and previous windows, weighted by how far into the current window you are. This gives you near-perfect accuracy with the memory efficiency of fixed windows. It's what most production rate limiters actually use.

Leaky Bucket

Models rate limiting as a bucket that leaks at a constant rate. Requests fill the bucket; if it overflows, requests are rejected. The leak rate determines your sustained throughput.

How It Works

  1. Define a bucket capacity (maximum burst size) and a leak rate (requests per second).
  2. When a request arrives, calculate how much the bucket has drained since the last request.
  3. Subtract the drained amount from the current water level.
  4. If adding the new request would overflow the bucket, reject it.
  5. Otherwise, add the request to the bucket.
# Leaky Bucket -- Python + Redis with Lua script
import redis

r = redis.Redis()

LEAKY_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local leak_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('hmget', key, 'water', 'last_leak')
local water = tonumber(bucket[1]) or 0
local last_leak = tonumber(bucket[2]) or now

-- Calculate leaked amount
local elapsed = now - last_leak
local leaked = elapsed * leak_rate
water = math.max(0, water - leaked)

-- Check if request fits
if water + 1 > capacity then
    redis.call('hmset', key, 'water', water, 'last_leak', now)
    redis.call('expire', key, math.ceil(capacity / leak_rate) + 1)
    return 0  -- rejected
end

-- Accept request
water = water + 1
redis.call('hmset', key, 'water', water, 'last_leak', now)
redis.call('expire', key, math.ceil(capacity / leak_rate) + 1)
return 1  -- allowed
"""

leaky_script = r.register_script(LEAKY_BUCKET_SCRIPT)

def is_allowed(client_id: str, capacity: int = 10, leak_rate: float = 1.0) -> bool:
    import time
    key = f"rate:{client_id}"
    result = leaky_script(keys=[key], args=[capacity, leak_rate, time.time()])
    return bool(result)

Pros: Smooths out bursts into a constant output rate. Prevents the thundering herd problem. Memory-efficient -- two values per client.

Cons: Not great when you want to allow legitimate bursts. A client that's been idle can't burst even though they haven't used their quota.

Token Bucket

The most widely used rate limiting algorithm. Tokens are added to a bucket at a fixed rate. Each request consumes a token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, which determines the maximum burst size.

How It Works

  1. Define a bucket capacity (max tokens) and a refill rate (tokens per second).
  2. When a request arrives, calculate how many tokens have been added since the last request.
  3. Add those tokens to the bucket (up to the capacity).
  4. If there's at least one token, consume it and allow the request.
  5. If the bucket is empty, reject the request.
# Token Bucket -- Python + Redis with Lua script
import redis
import time

r = redis.Redis()

TOKEN_BUCKET_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local bucket = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)

-- Check if request can proceed
if tokens < 1 then
    redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
    redis.call('expire', key, math.ceil(capacity / refill_rate) + 1)
    return 0  -- rejected
end

-- Consume token
tokens = tokens - 1
redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
redis.call('expire', key, math.ceil(capacity / refill_rate) + 1)
return 1  -- allowed
"""

token_script = r.register_script(TOKEN_BUCKET_SCRIPT)

def is_allowed(client_id: str, capacity: int = 100, refill_rate: float = 10.0) -> bool:
    key = f"rate:{client_id}"
    result = token_script(keys=[key], args=[capacity, refill_rate, time.time()])
    return bool(result)

Pros: Allows bursts up to the bucket capacity while maintaining an average rate. Memory-efficient. The most intuitive model for API rate limits.

Cons: A client can consume their entire burst allowance instantly and then face a drought. You need to tune both capacity and refill rate.

Algorithm Comparison

AlgorithmMemory Per ClientBurst HandlingBoundary ProblemComplexityBest For
Fixed WindowO(1) -- one counterAllows 2x burst at boundaryYesLowSimple rate limits, internal APIs
Sliding Window LogO(N) -- one entry per requestNo bursts beyond limitNoMediumLow-volume, accuracy-critical
Leaky BucketO(1) -- two valuesSmooths all burstsNoMediumTraffic shaping, constant output
Token BucketO(1) -- two valuesAllows controlled burstsNoMediumAPI rate limiting, most use cases

Watch out: All Redis-based rate limiting implementations must use Lua scripts for atomicity. If you split the read-check-write into separate Redis commands, you'll have race conditions under concurrent load. A Lua script executes atomically in Redis -- there's no way for another request to interleave.

nginx Rate Limiting with limit_req

nginx has built-in rate limiting using the leaky bucket algorithm. It's the fastest option for edge rate limiting because it operates at the reverse proxy layer before your application code runs.

# nginx.conf -- rate limiting configuration
http {
    # Define a rate limit zone
    # $binary_remote_addr = client IP (compact binary form)
    # zone=api:10m = 10MB shared memory zone named "api"
    # rate=10r/s = 10 requests per second
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

    # Custom 429 response
    limit_req_status 429;

    server {
        listen 80;

        location /api/ {
            # burst=20 allows 20 requests to queue beyond the rate
            # nodelay processes queued requests immediately
            limit_req zone=api burst=20 nodelay;

            proxy_pass http://backend;
        }
    }
}

The burst parameter is the bucket capacity. Without nodelay, excess requests are queued and processed at the leak rate. With nodelay, queued requests are processed immediately but still count against the burst limit.

Per-API-Key Rate Limiting in nginx

# Rate limit by API key instead of IP
map $http_x_api_key $api_key {
    default $http_x_api_key;
    ""      $binary_remote_addr;
}

limit_req_zone $api_key zone=api_key:10m rate=100r/s;

location /api/ {
    limit_req zone=api_key burst=50 nodelay;
    proxy_pass http://backend;
}

Distributed Rate Limiting

Single-node rate limiting breaks when you have multiple API servers behind a load balancer. A client with a 100 req/s limit can send 100 req/s to each of your five servers, effectively getting 500 req/s.

Solutions for Distributed Rate Limiting

  1. Centralized Redis -- all servers check the same Redis instance. This is the most common approach. The Lua scripts above work unchanged in a distributed setup.
  2. API Gateway -- offload rate limiting to your API gateway (Kong, AWS API Gateway, Envoy). The gateway is a single choke point, so rate limits are inherently global.
  3. Local + Global hybrid -- each server enforces a local rate limit (e.g., limit / num_servers) and periodically syncs with a global counter. Reduces Redis round trips at the cost of some inaccuracy.
  4. Consistent hashing -- route all requests from a given client to the same server. The rate limit is then local. This breaks if servers are added or removed.

Rate Limiting Headers and HTTP 429

Good rate limiting is transparent. Clients should know their limits, current usage, and when they can retry. Use these standard headers:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1712534400

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Try again in 30 seconds.",
    "retryAfter": 30
  }
}

Pro tip: Always include the Retry-After header in 429 responses. Well-behaved clients use it to back off automatically. Without it, clients often retry immediately in a tight loop, making the overload worse. The IETF draft RateLimit-* headers are becoming the standard -- adopt them early.

Managed Rate Limiting Services

If you don't want to build rate limiting yourself, several managed services handle it at the edge:

ServiceRate Limiting ApproachEstimated CostBest For
AWS API GatewayToken bucket per API key/stageIncluded in API Gateway pricingAWS-native APIs
Cloudflare Rate LimitingFixed window + IP/header rules$0.05 per 10K good requestsEdge-level protection
Kong GatewayFixed window, sliding window, Redis-backed$150+/month (Cloud)Multi-protocol gateways
Redis EnterpriseYour algorithm + Redis infrastructure$65+/monthCustom rate limiting logic
Upstash RedisServerless Redis for rate limitingPay per command ($0.2/100K)Serverless, low-volume APIs

Frequently Asked Questions

Which rate limiting algorithm should I use?

Token Bucket for most API rate limiting -- it allows controlled bursts while maintaining an average rate. Leaky Bucket if you need strict traffic shaping with constant output. Fixed Window Counter if simplicity matters more than accuracy. Sliding Window Log only for low-volume, accuracy-critical scenarios.

What is the boundary problem in fixed window rate limiting?

A client can make the maximum number of requests at the end of one window and the same number at the start of the next, effectively doubling their rate for a brief period at the boundary. The Sliding Window Counter solves this by weighting requests from the previous window based on overlap.

Why use Redis Lua scripts for rate limiting?

Redis Lua scripts execute atomically -- no other commands can interleave during execution. Without atomic operations, concurrent requests create race conditions where two requests both read the counter, both see room under the limit, and both increment. Lua scripts eliminate this class of bugs entirely.

How does nginx rate limiting work?

nginx uses the Leaky Bucket algorithm via the limit_req directive. You define a rate (requests per second) and a burst size. Requests within the rate pass through immediately. Burst requests are queued unless nodelay is specified. Requests beyond the burst are rejected with 503 (or a custom status code).

What HTTP status code should I return for rate-limited requests?

HTTP 429 (Too Many Requests) is the correct status code, defined in RFC 6585. Always include a Retry-After header indicating when the client can retry. Some APIs also include X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset headers so clients can self-regulate.

How do I rate limit by API key instead of IP address?

Use the API key as the rate limiting key instead of the client IP. In Redis-based implementations, replace the IP in the key string with the API key. In nginx, use a map directive to extract the key from a header. This lets you assign different rate limits to different API key tiers.

What is the difference between rate limiting and throttling?

Rate limiting rejects requests that exceed a defined threshold. Throttling slows down (queues or delays) requests rather than rejecting them. The Leaky Bucket algorithm naturally throttles by queuing excess requests. In practice, the terms are often used interchangeably, but the distinction matters when designing client-facing behavior.

Start Simple, Then Tune

Don't over-engineer your first rate limiter. A Fixed Window Counter in Redis handles most cases and takes 15 minutes to implement. Once you see traffic patterns that expose its boundary problem, upgrade to a Sliding Window Counter or Token Bucket. Instrument your rate limiting with metrics -- track rejection rates, which clients hit limits, and whether your limits are too tight or too loose. The best rate limiter is the one you can adjust based on real data.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.