ArchitectureDevOpsDev Tooling

Software Doesn't Have to Get Slower

slow software

There is a phenomenon so common in software that it has become background noise: systems get slower over time. Not because the hardware degrades. Not because the data grows in ways nobody expected. Because of a long sequence of small decisions, each individually reasonable, that accumulate into something that would have been unacceptable if it had arrived all at once.

A feature ships. It adds a database query to a page load. The query is fast. Nobody notices. Another feature ships. Another query. Still fast. The product grows. The team grows. Features keep shipping. Queries keep accumulating. At some point the page load that used to take two hundred milliseconds takes two seconds. The team investigates. They find dozens of queries, none of them obviously the problem, all of them slightly worse than they should be, adding up to something that is clearly the problem.

This is the normal trajectory of a software system that is not actively maintained for performance. Slowdown is the default. Speed requires deliberate work to maintain.

Most teams treat performance as something to fix periodically rather than something to maintain continuously. The system becomes slow, an incident or a complaint surfaces the slowness, someone goes in and finds the worst offenders and fixes them, the system is fast again for a while, the accumulation resumes. The cycle repeats because the conditions that produced the slowdown were not changed. Only the symptoms were addressed.

Why slowdown is invisible until it is not

Performance degradation is invisible in the same way that technical debt is invisible: the unit of change is too small to notice, and the cumulative effect only becomes visible after so many small changes have compounded that the outcome is dramatically different from where things started.

A query that was fifty milliseconds six months ago and is now two hundred milliseconds did not degrade in a single moment. It degraded through a series of changes: a new index was added that helped write performance but slightly hurt read performance, a JOIN was added when a new feature needed related data, the table grew from a hundred thousand rows to five million. Each change moved the needle a little. No single change was the problem.

The team that notices this is the team that is measuring the query’s performance continuously and therefore has a chart that shows the trend. The team that does not notice until it is a problem is the team that has no historical performance data and therefore has no way to see the gradual change, only the current state compared to a vague memory of when things felt faster.

# The performance regression that appears invisible without measurement.
# Each version is a real commit. Each change seemed reasonable at the time.

# Version 1: The original query. Fast.
# SELECT id, name, email FROM users WHERE id = $1
# Execution time: 1.2ms average, 3.1ms p99

# Version 2: Added last login for security dashboard.
# SELECT id, name, email, last_login FROM users WHERE id = $1
# Execution time: 1.3ms average, 3.4ms p99

# Version 3: Added subscription status for billing feature.
# SELECT u.id, u.name, u.email, u.last_login, s.status, s.plan
# FROM users u LEFT JOIN subscriptions s ON s.user_id = u.id
# WHERE u.id = $1
# Execution time: 2.1ms average, 8.2ms p99

# Version 4: Added preference count for personalisation feature.
# SELECT u.id, u.name, u.email, u.last_login, s.status, s.plan,
#        COUNT(p.id) as preference_count
# FROM users u
# LEFT JOIN subscriptions s ON s.user_id = u.id
# LEFT JOIN user_preferences p ON p.user_id = u.id
# WHERE u.id = $1
# GROUP BY u.id, u.name, u.email, u.last_login, s.status, s.plan
# Execution time: 12.4ms average, 45ms p99

# Version 5: Added organisation data for team feature.
# Same query, plus another JOIN.
# Execution time: 38ms average, 120ms p99
# Table now has 2.8 million rows.

# This query is called on every authenticated page load.
# At 1000 requests per second, version 5 is spending 38 seconds
# of database time per second on a single query.
# The application is still "working". It is much slower.
# No single commit introduced an obvious problem.

The team that caught this would have caught it at version three, when the execution time jumped from one millisecond to two. That jump is significant. It is the point where a JOIN with no index on the join condition, or with a larger-than-expected table on the other side, is starting to cost something. Catching it at version three means a focused investigation with two recent changes to examine. Catching it at version five means sifting through months of changes with a query that has become genuinely complex.

The N+1 that everyone knows about and nobody catches

The N+1 query problem is one of the most well-known performance antipatterns in web development. It is also consistently the most common source of performance incidents in production systems. The fact that every engineer knows about it and it still happens constantly is worth examining.

It happens because it is invisible at the code level. The code that produces an N+1 looks correct. It reads naturally. It does what it says it does. The performance consequence is only visible at the database level, and only when the value of N is large enough that the cumulative cost becomes obvious.

# The code that looks correct and produces N+1 queries.
# In a framework like Django or SQLAlchemy this is the default
# behavior if you forget to eager-load relationships.

async def get_order_list(user_id: str) -> list[dict]:
    # Query 1: fetch orders
    orders = await db.fetch_all(
        "SELECT id, status, total, created_at FROM orders WHERE user_id = $1",
        user_id,
    )

    result = []
    for order in orders:
        # Query 2 through N+1: fetch line items for each order
        # If the user has 200 orders, this is 201 database queries.
        items = await db.fetch_all(
            "SELECT product_id, quantity, unit_price FROM order_items WHERE order_id = $1",
            order["id"],
        )
        result.append({
            "id": order["id"],
            "status": order["status"],
            "total": order["total"],
            "items": [dict(item) for item in items],
        })

    return result


# The correct version: two queries regardless of order count.
async def get_order_list_efficient(user_id: str) -> list[dict]:
    orders = await db.fetch_all(
        "SELECT id, status, total, created_at FROM orders WHERE user_id = $1",
        user_id,
    )

    if not orders:
        return []

    order_ids = [o["id"] for o in orders]

    # One query to fetch all items for all orders at once.
    # The database is significantly better at this than Python is.
    placeholders = ", ".join(f"${i+1}" for i in range(len(order_ids)))
    items = await db.fetch_all(
        f"SELECT order_id, product_id, quantity, unit_price "
        f"FROM order_items WHERE order_id IN ({placeholders})",
        *order_ids,
    )

    items_by_order: dict[str, list] = {}
    for item in items:
        items_by_order.setdefault(item["order_id"], []).append(dict(item))

    return [
        {
            "id": order["id"],
            "status": order["status"],
            "total": order["total"],
            "items": items_by_order.get(order["id"], []),
        }
        for order in orders
    ]

The reason N+1 persists despite being well-known is that it requires seeing the database query log alongside the application code to detect it. Developers working on features see the application code. They do not routinely see the database query log. The two views of the system are separated by the abstraction layer, and the performance consequence lives in the layer that developers are not looking at.

The fix is to make the query log visible during development. Every development environment should have query logging enabled by default, surfacing the number of queries per request and flagging when a request produces more queries than a threshold. The developer who sees “this endpoint made 247 database queries” during testing will find the N+1. The developer who sees only the application code will not.

# Query counting middleware for development environments.
# Shows the number of database queries per request.
# Makes N+1 problems visible before they reach production.

import time
import logging
from contextlib import contextmanager
from typing import Generator

log = logging.getLogger(__name__)


class QueryCounter:
    def __init__(self):
        self.count = 0
        self.queries: list[dict] = []
        self.slow_query_threshold_ms = 50

    def record(self, query: str, duration_ms: float) -> None:
        self.count += 1
        self.queries.append({
            "query": query[:200],
            "duration_ms": round(duration_ms, 2),
            "slow": duration_ms > self.slow_query_threshold_ms,
        })


# Thread-local storage so each request has its own counter
import contextvars
_query_counter: contextvars.ContextVar[QueryCounter] = contextvars.ContextVar(
    "query_counter"
)


@contextmanager
def count_queries() -> Generator[QueryCounter, None, None]:
    counter = QueryCounter()
    token = _query_counter.set(counter)
    try:
        yield counter
    finally:
        _query_counter.reset(token)


class DevelopmentQueryMiddleware:
    N_PLUS_ONE_THRESHOLD = 10

    def __init__(self, app):
        self.app = app

    async def __call__(self, request, call_next):
        import os
        if os.getenv("ENVIRONMENT") != "development":
            return await call_next(request)

        with count_queries() as counter:
            start = time.perf_counter()
            response = await call_next(request)
            duration_ms = (time.perf_counter() - start) * 1000

        if counter.count >= self.N_PLUS_ONE_THRESHOLD:
            log.warning(
                "potential_n_plus_one_detected",
                path=request.url.path,
                query_count=counter.count,
                request_duration_ms=round(duration_ms, 2),
                slowest_queries=[
                    q for q in counter.queries if q["slow"]
                ][:5],
            )

        log.debug(
            "request.query.summary",
            path=request.url.path,
            query_count=counter.count,
            request_duration_ms=round(duration_ms, 2),
        )

        response.headers["X-Query-Count"] = str(counter.count)
        response.headers["X-Request-Duration-Ms"] = str(round(duration_ms, 2))

        return response

The query count in the response header is visible in the browser developer tools. A developer building a feature sees the query count change as they work. The feedback is immediate rather than delayed until someone files a performance complaint.

The cache that made everything worse

Caching is the instinctive response to performance problems. Something is slow. Cache the result. The something is now fast. The problem is solved.

This is true often enough that caching has become a reflex rather than a considered decision. And reflexive caching produces a specific category of problems that are harder to diagnose than the original performance issue.

A cache that is too aggressive returns stale data. A cache that has no eviction strategy grows until it consumes all available memory. A cache that is shared across processes without a locking strategy produces thundering herd problems where the cache expires and every process simultaneously attempts to regenerate it, producing exactly the database load the cache was supposed to prevent. A cache in front of a system that was not designed to be cached hides the symptoms of underlying performance problems rather than addressing them, which means the underlying problems grow unchecked until the cache layer fails.

# Caching done correctly for a result that is expensive to compute
# and acceptable to be slightly stale.

import asyncio
import hashlib
import json
import time
from typing import Any, Callable, Optional
import redis.asyncio as redis


class SmartCache:
    """
    Cache with:
    - TTL that varies with content age to prevent thundering herd
    - Background refresh to prevent cache-miss latency spikes
    - Circuit breaker to fail open when cache is unavailable
    - Metrics to make cache behavior observable
    """

    def __init__(
        self,
        redis_client: redis.Redis,
        default_ttl_seconds: int = 300,
        jitter_fraction: float = 0.1,
    ):
        self.redis = redis_client
        self.default_ttl = default_ttl_seconds
        self.jitter_fraction = jitter_fraction
        self._refresh_tasks: set[asyncio.Task] = set()

    def _jittered_ttl(self, base_ttl: int) -> int:
        import random
        jitter = int(base_ttl * self.jitter_fraction * random.random())
        return base_ttl + jitter

    def _cache_key(self, namespace: str, params: dict) -> str:
        params_hash = hashlib.sha256(
            json.dumps(params, sort_keys=True).encode()
        ).hexdigest()[:16]
        return f"cache:{namespace}:{params_hash}"

    async def get_or_compute(
        self,
        namespace: str,
        params: dict,
        compute_fn: Callable,
        ttl_seconds: Optional[int] = None,
    ) -> Any:
        ttl = ttl_seconds or self.default_ttl
        key = self._cache_key(namespace, params)

        try:
            cached = await self.redis.get(key)
            if cached is not None:
                data = json.loads(cached)

                remaining_ttl = await self.redis.ttl(key)
                if remaining_ttl < ttl * 0.2:
                    task = asyncio.create_task(
                        self._refresh_in_background(key, params, compute_fn, ttl)
                    )
                    self._refresh_tasks.add(task)
                    task.add_done_callback(self._refresh_tasks.discard)

                return data["value"]

        except (redis.RedisError, json.JSONDecodeError):
            # Cache unavailable or corrupt. Fail open: compute the value.
            pass

        value = await compute_fn(**params)

        try:
            await self.redis.setex(
                key,
                self._jittered_ttl(ttl),
                json.dumps({"value": value, "computed_at": time.time()}),
            )
        except redis.RedisError:
            # Could not write to cache. Return the value anyway.
            pass

        return value

    async def _refresh_in_background(
        self,
        key: str,
        params: dict,
        compute_fn: Callable,
        ttl: int,
    ) -> None:
        try:
            value = await compute_fn(**params)
            await self.redis.setex(
                key,
                self._jittered_ttl(ttl),
                json.dumps({"value": value, "computed_at": time.time()}),
            )
        except Exception as e:
            import logging
            logging.getLogger(__name__).error(
                "cache.background_refresh.failed",
                key=key,
                error=str(e),
            )

Three specific things in this implementation prevent the common caching failures. The jittered TTL prevents thundering herd: cache entries for the same type of data expire at slightly different times rather than all at once. The background refresh prevents cache-miss latency spikes: entries that are about to expire are refreshed before they expire, so clients never wait for a slow computation. The fail-open pattern prevents cache unavailability from becoming a system outage: if Redis is down, the system computes values directly rather than failing.

None of this is complicated. Most caching implementations omit all of it because the simple version works until the specific failure mode it does not handle appears in production.

Indexes are not free

Every database index speeds up reads and slows down writes. On a write-heavy system, too many indexes degrade write performance more than they improve read performance. On a read-heavy system, missing indexes on common query patterns are the most common source of unexpected slowness.

The failure mode in both directions comes from adding indexes without a clear understanding of the query patterns the system actually has. Index the wrong columns and the indexes do not help. Add too many indexes and the write overhead accumulates until the system cannot ingest data at the required rate.

-- Finding queries that would benefit from an index
-- but do not currently have one.
-- Run this monthly and investigate the top results.

SELECT
    query,
    calls,
    total_exec_time / calls AS avg_exec_time_ms,
    total_exec_time,
    rows / calls AS avg_rows,
    -- High sequential scans with many calls and long execution
    -- time are candidates for indexing
    (total_exec_time / calls) * calls AS total_time_cost
FROM pg_stat_statements
WHERE
    query NOT LIKE '%pg_%'
    AND query NOT LIKE '%information_schema%'
    AND calls > 100
    AND query ~* 'seq scan|sequential'
ORDER BY total_time_cost DESC
LIMIT 20;

-- Finding indexes that are not being used.
-- Indexes that have never been scanned are overhead
-- with no benefit.

SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan AS number_of_times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
    AND indexname NOT LIKE '%pkey%'  -- Keep primary keys
ORDER BY pg_relation_size(indexrelid) DESC;

The unused index query is the one that produces the most actionable results in most production systems. Indexes accumulate over the lifetime of a system because adding an index is the instinctive response to a slow query and removing one requires confidence that it is not needed. Most teams add but rarely remove. The result is a table with thirty indexes, most of them doing nothing for reads while all of them adding overhead to every write.

Performance as a product requirement

The underlying issue in all of the above is that performance is treated as a technical concern rather than a product requirement. Features have acceptance criteria. Performance does not. A feature is done when it works correctly. There is no equivalent definition of when something is fast enough, and without that definition there is no mechanism to catch when something has become too slow.

The teams that maintain performance over time have defined what acceptable performance looks like, encoded that definition into automated checks, and treat a performance regression as a failed build rather than as a future investigation.

# Performance assertions as part of the test suite.
# Not a load test. A regression detector.
# Fails the build when a specific operation has gotten slower.

import pytest
import statistics
import time
from typing import Callable


def assert_p99_under(
    operation: Callable,
    threshold_ms: float,
    iterations: int = 100,
    warmup_iterations: int = 10,
) -> dict:
    """
    Measures the p99 latency of an operation and fails if it
    exceeds the threshold. Used as a performance regression gate.

    The threshold should be set based on measured current performance
    plus a comfortable margin. Not based on what would be nice.
    What is actually acceptable given the current system.
    """
    # Warmup to ensure JIT, connection pooling etc. are initialised
    for _ in range(warmup_iterations):
        operation()

    latencies = []
    for _ in range(iterations):
        start = time.perf_counter()
        operation()
        latencies.append((time.perf_counter() - start) * 1000)

    latencies.sort()
    p50 = statistics.median(latencies)
    p95 = latencies[int(len(latencies) * 0.95)]
    p99 = latencies[int(len(latencies) * 0.99)]

    assert p99 <= threshold_ms, (
        f"Performance regression detected.\n"
        f"p99 latency: {p99:.1f}ms\n"
        f"Threshold:   {threshold_ms:.1f}ms\n"
        f"p50:         {p50:.1f}ms\n"
        f"p95:         {p95:.1f}ms\n"
        f"This operation has gotten slower. Investigate before merging."
    )

    return {"p50": p50, "p95": p95, "p99": p99}


# In the test suite:
class TestOrderServicePerformance:

    def test_get_user_orders_p99_under_50ms(self, db, test_user):
        # Seed realistic data volume
        seed_orders(db, user_id=test_user.id, count=500)

        def operation():
            return get_order_list_efficient(test_user.id)

        assert_p99_under(operation, threshold_ms=50.0)

    def test_product_search_p99_under_100ms(self, db):
        seed_products(db, count=50_000)

        def operation():
            return search_products(query="laptop", limit=20)

        assert_p99_under(operation, threshold_ms=100.0)

A test that fails when a database query has gotten fifty percent slower catches the regression at the point when it is introduced, when there are one or two recent changes to examine. A system without this test discovers the regression months later when it has compounded with other regressions into something that is visibly broken and difficult to trace back to its origin.

The threshold matters less than having one. A threshold of one hundred milliseconds when the query currently runs in five milliseconds is loose enough to be useless. A threshold of six milliseconds when the query currently runs in five milliseconds will produce false failures. The right threshold is the current p99 plus a margin that distinguishes measurement noise from real regression. Twenty to thirty percent above the current measurement is a reasonable starting point.

The compounding that goes the other way

Everything described above is about compounding in the wrong direction: small degradations accumulating into significant slowness. The same compounding works in the other direction for teams that treat performance as a continuous practice.

A team that catches performance regressions at introduction time has a system whose performance improves as features are added rather than degrading. Each new feature is built with the constraint that it must not make the system slower, which produces different architectural decisions than features built without that constraint.

The system does not automatically get faster just because regressions are prevented. But the budget that was previously being spent on periodic performance rescue efforts becomes available for deliberate improvement, because there is no rescue needed. The investment in measurement and gating pays for itself by eliminating the cyclical work of fixing slowness that should never have accumulated.

A system that was fast a year ago and is still fast today, despite significant new features, is not a lucky outcome. It is the output of a team that decided performance was something to maintain rather than something to periodically restore.

That decision does not require special expertise or significant tooling investment. It requires treating performance measurements as first-class artifacts alongside correctness tests, making slow queries visible during development, and establishing thresholds that turn regressions into feedback rather than future surprises.

The default trajectory of software is toward slowness. The teams that produce fast software have simply decided not to follow the default.