ArchitectureDevOpsDev Tooling

Rate Limiting Is Not a Feature

rate limiting

Rate limiting gets added to systems in a specific way. Something bad happens. Either a client hammers an endpoint until the database falls over, or a bug in a consumer creates a retry storm, or a single large customer accidentally uses the product in a way that degrades the experience for everyone else. An incident happens. Someone says “we need rate limiting.” Rate limiting gets added. The specific incident that prompted it does not happen again.

The next incident involves a different failure mode that the rate limiting does not cover, because the rate limiting was designed to prevent the specific thing that just happened rather than to protect the system against the class of things that can go wrong.

This is the pattern: rate limiting added reactively, to specific endpoints, with limits chosen based on what the recent incident looked like rather than based on what the system can actually sustain. The result is a patchwork of limits that are arbitrarily different from each other, that protect some parts of the system while leaving others exposed, and that neither developers nor users understand well enough to work with confidently.

Building rate limiting that actually works requires thinking about it differently. Not as a feature that gets added to protect specific endpoints. As a systematic answer to the question of what the system can sustain and how to ensure it is never asked to do more than that.

What rate limiting is actually protecting

The framing of rate limiting as protection against abusive clients is both common and incomplete. It captures one real use case and misses several others.

Abusive clients are real. Scrapers, bots, misconfigured integrations, developers testing against production because staging does not have real data. Limiting these clients protects the system from being overwhelmed by a single source. This is the use case that most rate limiting implementations address.

But the more important protection is against the system itself.

A system under heavy legitimate load can fail in ways that a system under moderate load does not. Queries that are fast with low concurrency become slow with high concurrency. Connection pools exhaust. Memory pressure increases. Latency spikes. Other services that depend on the loaded service start failing. The failure cascades in ways that are hard to predict and hard to stop once they start.

Rate limiting prevents this not by assuming the load is malicious but by acknowledging that every system has a capacity ceiling and that exceeding it reliably produces bad outcomes for everyone, including the clients that caused the overload.

The third use case is fairness. A multi-tenant system where one large customer can consume so much capacity that other customers experience degraded service is a system with a fairness problem. Rate limiting is how you enforce the implicit contract that every customer gets a reasonable share of the system’s capacity regardless of whether they are the largest customer.

These three use cases, protection from abuse, protection from self, and fairness, have different shapes and require different approaches. A system that only implements the first is missing most of what rate limiting is for.

The limit that is the wrong limit

The most common rate limit in production systems is requests per minute per IP address. It is common because it is easy to implement and because it addresses the most obvious failure mode: a single source sending many requests.

It is also the wrong limit for most of what you are actually protecting.

Requests per minute measures the rate of requests. It does not measure the cost of requests. A system where one request does a simple cache lookup and another request runs a complex aggregation query across millions of rows is a system where counting requests treats fundamentally different costs as equivalent. Limiting to one hundred requests per minute means something very different depending on which of those two request types is being sent.

# The common implementation: counts requests, ignores their cost
from collections import defaultdict
from datetime import datetime, timezone
import time


class NaiveRateLimiter:
    def __init__(self, requests_per_minute: int):
        self.limit = requests_per_minute
        self.window_seconds = 60
        self.counts: dict[str, list[float]] = defaultdict(list)

    def is_allowed(self, client_id: str) -> bool:
        now = time.time()
        window_start = now - self.window_seconds

        self.counts[client_id] = [
            ts for ts in self.counts[client_id]
            if ts > window_start
        ]

        if len(self.counts[client_id]) >= self.limit:
            return False

        self.counts[client_id].append(now)
        return True


# The problem: every request costs the same.
# A request that queries the search index for 2 seconds
# counts the same as a request that returns a cached value in 2ms.
# The rate limit protects against request volume
# but not against the capacity consumption that actually causes failures.

The better framing is cost-based rate limiting. Each operation has a cost, expressed in abstract units, that reflects its actual resource consumption. A cheap operation costs one unit. An expensive operation costs fifty. The rate limit is expressed in units per time window rather than requests per time window. Clients exhaust their budget based on what they do, not just how often they do it.

from dataclasses import dataclass
from typing import Optional
import time
import redis


@dataclass
class OperationCost:
    """
    Defines the cost of an operation in abstract units.
    These should be calibrated against actual resource consumption.
    Start rough. Refine as you observe real usage patterns.
    """
    simple_read: int = 1          # Cache hit or simple indexed query
    complex_read: int = 10        # Multi-table join or aggregation
    search: int = 25              # Full-text search or vector similarity
    write: int = 5                # Single record create or update
    bulk_operation: int = 50      # Batch import or mass update
    ai_inference_small: int = 100 # Short LLM completion
    ai_inference_large: int = 500 # Long LLM completion or embedding batch


class CostBasedRateLimiter:
    """
    Token bucket rate limiter with cost-based token consumption.
    Uses Redis for distributed state so limits apply across
    multiple instances of the service.
    """

    def __init__(
        self,
        redis_client: redis.Redis,
        bucket_capacity: int = 1000,
        refill_rate_per_second: float = 16.67,  # 1000 units per minute
    ):
        self.redis = redis_client
        self.bucket_capacity = bucket_capacity
        self.refill_rate = refill_rate_per_second

    def _bucket_key(self, client_id: str) -> str:
        return f"ratelimit:bucket:{client_id}"

    def _last_refill_key(self, client_id: str) -> str:
        return f"ratelimit:refill:{client_id}"

    def check_and_consume(
        self,
        client_id: str,
        cost: int,
    ) -> tuple[bool, dict]:
        """
        Returns (allowed, metadata) where metadata contains
        information useful for the Retry-After and X-RateLimit headers.
        """
        now = time.time()
        bucket_key = self._bucket_key(client_id)
        refill_key = self._last_refill_key(client_id)

        pipe = self.redis.pipeline()
        pipe.get(bucket_key)
        pipe.get(refill_key)
        current_tokens_raw, last_refill_raw = pipe.execute()

        current_tokens = float(current_tokens_raw or self.bucket_capacity)
        last_refill = float(last_refill_raw or now)

        elapsed = now - last_refill
        refill_amount = elapsed * self.refill_rate
        current_tokens = min(
            self.bucket_capacity,
            current_tokens + refill_amount,
        )

        if current_tokens < cost:
            tokens_needed = cost - current_tokens
            seconds_until_available = tokens_needed / self.refill_rate

            return False, {
                "allowed": False,
                "current_tokens": int(current_tokens),
                "cost": cost,
                "retry_after_seconds": int(seconds_until_available) + 1,
                "bucket_capacity": self.bucket_capacity,
            }

        new_token_count = current_tokens - cost

        pipe = self.redis.pipeline()
        pipe.set(bucket_key, new_token_count, ex=3600)
        pipe.set(refill_key, now, ex=3600)
        pipe.execute()

        return True, {
            "allowed": True,
            "current_tokens": int(new_token_count),
            "cost": cost,
            "bucket_capacity": self.bucket_capacity,
        }

The bucket capacity and refill rate are the levers. The bucket capacity determines how much burst traffic is allowed before limiting kicks in. The refill rate determines the sustained throughput over time. Setting these correctly requires knowing what the system can actually sustain, which is a load testing problem rather than a configuration problem.

The response that tells the client nothing useful

When a rate limit is hit, most implementations return a 429 status code and a short error message. The client knows they have been rate limited. They do not know when they can try again, how far over the limit they were, or what they should do differently to avoid hitting the limit.

This produces a specific pattern in client behavior: the client retries immediately, hits the rate limit again, retries again, and continues until either the limit resets or the client gives up. The retry storm is not malicious. It is the predictable response to a system that says “no” without saying “not yet” or “not like that.”

The response headers that prevent this are standard and simple.

from fastapi import Request, Response
from fastapi.responses import JSONResponse
import logging

log = logging.getLogger(__name__)


class RateLimitMiddleware:
    def __init__(self, app, limiter: CostBasedRateLimiter, costs: OperationCost):
        self.app = app
        self.limiter = limiter
        self.costs = costs

    def get_client_id(self, request: Request) -> str:
        # Prefer authenticated user ID over IP address.
        # IP-based limiting is easily bypassed and punishes shared NAT.
        user = getattr(request.state, "user", None)
        if user and user.id:
            return f"user:{user.id}"

        # For unauthenticated requests, fall back to IP.
        # Consider that many users may share an IP (corporate NAT, etc.)
        forwarded_for = request.headers.get("X-Forwarded-For")
        if forwarded_for:
            return f"ip:{forwarded_for.split(',')[0].strip()}"

        return f"ip:{request.client.host}"

    def get_request_cost(self, request: Request) -> int:
        path = request.url.path
        method = request.method

        # Cost routing based on operation type.
        # These mappings should reflect actual resource cost.
        if "/search" in path:
            return self.costs.search
        if "/ai/" in path or "/generate/" in path:
            return self.costs.ai_inference_large
        if "/bulk/" in path or "/batch/" in path:
            return self.costs.bulk_operation
        if method in ("POST", "PUT", "PATCH", "DELETE"):
            return self.costs.write
        if "/aggregate/" in path or "/report/" in path:
            return self.costs.complex_read

        return self.costs.simple_read

    async def __call__(self, request: Request, call_next):
        client_id = self.get_client_id(request)
        cost = self.get_request_cost(request)

        allowed, metadata = self.limiter.check_and_consume(client_id, cost)

        if not allowed:
            log.warning(
                "rate_limit.exceeded",
                client_id=client_id,
                path=request.url.path,
                cost=cost,
                current_tokens=metadata["current_tokens"],
                retry_after=metadata["retry_after_seconds"],
            )

            return JSONResponse(
                status_code=429,
                content={
                    "error": {
                        "code": "rate_limit_exceeded",
                        "message": "You have exceeded the rate limit for this API.",
                        "retry_after_seconds": metadata["retry_after_seconds"],
                        "current_tokens": metadata["current_tokens"],
                        "request_cost": cost,
                        "docs_url": "https://docs.example.com/rate-limiting",
                    }
                },
                headers={
                    # Standard headers that HTTP clients understand
                    "Retry-After": str(metadata["retry_after_seconds"]),
                    "X-RateLimit-Limit": str(metadata["bucket_capacity"]),
                    "X-RateLimit-Remaining": str(metadata["current_tokens"]),
                    "X-RateLimit-Reset": str(
                        int(time.time()) + metadata["retry_after_seconds"]
                    ),
                },
            )

        response = await call_next(request)

        # Include rate limit headers on successful responses too.
        # Clients can use these to throttle proactively rather than reactively.
        response.headers["X-RateLimit-Limit"] = str(metadata["bucket_capacity"])
        response.headers["X-RateLimit-Remaining"] = str(metadata["current_tokens"])
        response.headers["X-RateLimit-Cost"] = str(cost)

        return response

The headers on successful responses are as important as the headers on rejected ones. A client that can see their remaining budget on every response can throttle proactively rather than waiting to be rejected. This reduces the number of 429 responses, which reduces the noise in your logs and the friction for your clients.

The limit that punishes the wrong person

IP-based rate limiting has a failure mode that teams discover embarrassingly late: it punishes everyone behind a shared IP address.

Corporate networks often route all outbound traffic through a small number of IP addresses. A company with five hundred developers using your API might appear to come from three IP addresses. A university with ten thousand students might appear to come from one. Rate limiting by IP address treats all of these users as a single client. When the limit is reached, it is reached for everyone behind that IP, not just for the client that was making the most requests.

The developer at the company who is doing something heavy gets their colleagues locked out. The student running a project gets their entire campus blocked.

# A tiered client identification strategy that avoids the shared-IP problem

def get_client_id_tiered(request: Request) -> tuple[str, str]:
    """
    Returns (client_id, client_type) where client_type describes
    how the client was identified. Used for observability and debugging.

    Priority order:
    1. Authenticated API key with known client identity
    2. Authenticated session with known user identity
    3. IP address (fallback, least precise)
    """

    api_key = request.headers.get("X-API-Key")
    if api_key:
        client = validate_api_key(api_key)
        if client:
            return f"apikey:{client.id}", "api_key"

    user = getattr(request.state, "user", None)
    if user and user.id:
        return f"user:{user.id}", "authenticated_user"

    forwarded_for = request.headers.get("X-Forwarded-For", "")
    ip = forwarded_for.split(",")[0].strip() if forwarded_for else request.client.host
    return f"ip:{ip}", "ip_address"


# API key clients can have higher limits than anonymous IP clients.
# The limit reflects the trust level, not just the identity.

CLIENT_TYPE_LIMITS = {
    "api_key": {
        "bucket_capacity": 10000,
        "refill_rate_per_second": 166.7,  # 10000 per minute
    },
    "authenticated_user": {
        "bucket_capacity": 2000,
        "refill_rate_per_second": 33.3,   # 2000 per minute
    },
    "ip_address": {
        "bucket_capacity": 200,
        "refill_rate_per_second": 3.3,    # 200 per minute
    },
}

Different limits for different client types is not just about being generous to known clients. It is about calibrating the limit to the precision of the identity. An IP address is a poor proxy for a client. A user ID is a better proxy. An API key tied to a specific integration is the best proxy. The limit should reflect how precisely you know who you are talking to.

The rate limit nobody sets

There is a category of limit that almost no implementation includes and that is responsible for some of the most expensive production incidents in systems that have good API rate limiting in place.

Internal rate limits.

Most rate limiting is applied at the edge. Requests come in from outside the system and are rate limited before they reach the internal services. The internal services themselves can call each other without limits, because the assumption is that internal traffic is controlled and trustworthy.

This assumption fails in specific ways. A bug in a service that causes it to call a downstream service in a loop. A new feature that triggers a fan-out, one incoming request becoming a hundred internal requests to a shared dependency. A batch job that runs concurrently with user-facing traffic and consumes the database connection pool.

The failure mode is the same as uncontrolled external traffic, with the additional property that the source is inside the trust boundary so the external rate limiting does not apply.

# Internal service client with built-in rate limiting and circuit breaking

import asyncio
import httpx
from dataclasses import dataclass
from datetime import datetime, timezone
from typing import Optional
import logging

log = logging.getLogger(__name__)


@dataclass
class CircuitBreakerState:
    failures: int = 0
    last_failure_time: Optional[float] = None
    is_open: bool = False
    failure_threshold: int = 5
    recovery_timeout_seconds: float = 30.0

    def record_failure(self) -> None:
        import time
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.is_open = True
            log.error(
                "circuit_breaker.opened",
                failures=self.failures,
                threshold=self.failure_threshold,
            )

    def record_success(self) -> None:
        self.failures = 0
        self.is_open = False

    def should_attempt(self) -> bool:
        import time
        if not self.is_open:
            return True
        if self.last_failure_time is None:
            return True
        elapsed = time.time() - self.last_failure_time
        if elapsed > self.recovery_timeout_seconds:
            log.info("circuit_breaker.attempting_recovery")
            return True
        return False


class RateLimitedInternalClient:
    """
    Internal service client that rate limits its own outbound calls.
    Prevents a single service from overwhelming its dependencies
    regardless of how many requests it receives.
    """

    def __init__(
        self,
        base_url: str,
        service_name: str,
        max_concurrent_requests: int = 20,
        requests_per_second: float = 50.0,
    ):
        self.base_url = base_url
        self.service_name = service_name
        self.semaphore = asyncio.Semaphore(max_concurrent_requests)
        self.min_interval = 1.0 / requests_per_second
        self.last_request_time: float = 0
        self.circuit_breaker = CircuitBreakerState()

    async def get(self, path: str, **kwargs) -> httpx.Response:
        if not self.circuit_breaker.should_attempt():
            raise ServiceUnavailableError(
                f"{self.service_name} circuit breaker is open. "
                f"Service appears to be degraded. Retry after "
                f"{self.circuit_breaker.recovery_timeout_seconds}s."
            )

        async with self.semaphore:
            import time
            now = time.time()
            elapsed = now - self.last_request_time
            if elapsed < self.min_interval:
                await asyncio.sleep(self.min_interval - elapsed)

            self.last_request_time = time.time()

            try:
                async with httpx.AsyncClient(timeout=10.0) as client:
                    response = await client.get(
                        f"{self.base_url}{path}",
                        **kwargs,
                    )
                    response.raise_for_status()
                    self.circuit_breaker.record_success()
                    return response

            except (httpx.HTTPError, httpx.TimeoutException) as e:
                self.circuit_breaker.record_failure()
                log.error(
                    "internal_client.request_failed",
                    service=self.service_name,
                    path=path,
                    error=str(e),
                    circuit_breaker_failures=self.circuit_breaker.failures,
                )
                raise


class ServiceUnavailableError(Exception):
    pass

The semaphore limits concurrency. The minimum interval limits throughput. Together they ensure that no matter how much load a service receives, its outbound calls to dependencies are bounded. The circuit breaker prevents a failing dependency from consuming all the retry capacity of the calling service while it recovers.

These patterns together mean that a bug that causes runaway internal calls hits the rate limiter and circuit breaker before it hits the downstream service. The damage is contained rather than cascaded.

Setting the limits correctly

The hardest part of rate limiting is not the implementation. It is knowing what numbers to use.

Limits set too low create friction for legitimate clients. Limits set too high do not protect the system. The right limits reflect the actual capacity of the system, which requires measurement rather than estimation.

The process is straightforward even if it takes time:

Load test the system at increasing request rates until it starts degrading. Note the throughput at which latency starts climbing. Note the throughput at which errors start appearing. The sustainable throughput is somewhere below the degradation point, with enough headroom that normal variation does not push the system into degradation.

Set the rate limit at roughly seventy to eighty percent of sustainable throughput. This gives legitimate clients significant capacity while keeping the system well away from its limits under normal operation. The remaining twenty to thirty percent is headroom for spikes, for operational overhead, and for the reality that load testing never perfectly represents production traffic patterns.

# A simple load test to establish baseline capacity.
# Run this before setting limits, not after the first incident.

import asyncio
import httpx
import statistics
import time


async def measure_capacity(
    endpoint: str,
    target_rps: float,
    duration_seconds: int = 60,
) -> dict:
    """
    Sends requests at target_rps for duration_seconds.
    Returns latency percentiles and error rate.
    Use results to calibrate rate limits.
    """
    latencies = []
    errors = 0
    total = 0

    interval = 1.0 / target_rps
    end_time = time.time() + duration_seconds

    async with httpx.AsyncClient(timeout=30.0) as client:
        while time.time() < end_time:
            start = time.perf_counter()
            try:
                response = await client.get(endpoint)
                latency = time.perf_counter() - start
                if response.status_code >= 500:
                    errors += 1
                else:
                    latencies.append(latency * 1000)
            except Exception:
                errors += 1
            total += 1
            await asyncio.sleep(max(0, interval - (time.perf_counter() - start)))

    if not latencies:
        return {"error": "all requests failed"}

    latencies.sort()

    return {
        "target_rps": target_rps,
        "total_requests": total,
        "error_rate": errors / total,
        "p50_ms": statistics.median(latencies),
        "p95_ms": latencies[int(len(latencies) * 0.95)],
        "p99_ms": latencies[int(len(latencies) * 0.99)],
        "throughput_rps": total / duration_seconds,
    }


# Run at increasing RPS until error rate exceeds 1% or p99 exceeds SLA.
# The highest RPS below those thresholds is your sustainable capacity.
# Set rate limits at 70-80% of that number.

What rate limiting cannot do

Rate limiting is good at preventing a known type of load from exceeding known thresholds. It is not good at handling traffic patterns it was not designed for.

A rate limit based on requests per minute does not help when the problem is request size rather than request count. A rate limit designed for human users does not help when a well-behaved automation client sends requests in a legitimate pattern that happens to be exactly at the limit continuously.

Rate limiting is one layer of protection among several. It works alongside timeouts, which prevent individual requests from consuming resources indefinitely. It works alongside circuit breakers, which prevent cascading failures when downstream services degrade. It works alongside load shedding, which deliberately drops lower-priority requests when the system is at capacity rather than accepting them and serving all requests slowly.

No single mechanism makes a system robust. A system that has rate limiting but no timeouts will have requests pile up behind slow downstream dependencies regardless of the rate limit. A system with rate limiting and timeouts but no observability cannot tell whether the rate limiting is working as intended or whether it is the wrong shape for the actual traffic.

The rate limit is the boundary you draw around what the system can sustain. Drawing it thoughtfully, enforcing it correctly, communicating it clearly to clients, and measuring whether it is doing what it is supposed to do: these are the things that make it useful.

Adding a number to a config file and calling it a rate limit is not the same thing.

The system does not care about the number in the config file. It cares about whether the load it receives matches what it was built to handle.

Rate limiting done right is how you ensure it does.