The GPU Is the New Database

GPU image

In 2004, if you were running a web application at any meaningful scale, your biggest infrastructure problem was the database. Not the application servers ,those were stateless, you could add more. The database was the single stateful thing everything depended on, it didn’t scale horizontally, it was expensive to run, and almost nobody knew how to operate it well.

Teams made every mistake. They put too much logic in the application and not enough in the database. They put too much in the database and not enough in the application. They didn’t index correctly. They didn’t cache correctly. They scaled vertically until they couldn’t, then scrambled to shard. They had no idea what their query plans looked like. They treated the database as a black box until it stopped working, then learned the hard way that it wasn’t.

Over the following decade, the patterns solidified. Connection pooling. Read replicas. Query analysis. Proper indexing strategy. Cache layers. The knowledge became common. The tools improved. Managed database services abstracted most of the complexity. Today a competent team can run a database at significant scale without extraordinary expertise.

We are now, in 2026, in the same position with GPU infrastructure. The GPU is the new database ,the expensive, stateful, poorly-understood bottleneck that everything AI depends on, that doesn’t scale the way people expect, that is being operated badly by the majority of teams running it, and for which the patterns have not yet solidified.

The teams that figure this out first will have an infrastructure advantage that is very difficult to close. The teams that don’t will spend the next five years making the same mistakes everyone made with databases in 2004, just faster and more expensively.

Why the GPU is not just a fast CPU

The first mistake most teams make with GPU infrastructure is treating GPUs as very fast CPUs. They’re not. They’re a fundamentally different computational model, and the mismatch between that model and how most people use them is where most of the waste comes from.

A CPU is optimised for latency ,completing a single complex task as quickly as possible. It has a small number of powerful cores, large caches, sophisticated branch prediction, and out-of-order execution. It’s good at sequential logic, conditional branching, and tasks where each step depends on the result of the previous one.

A GPU is optimised for throughput ,completing an enormous number of simple tasks simultaneously. It has thousands of smaller, simpler cores. It’s good at the same operation applied in parallel to a large amount of data. It’s bad at anything sequential, anything with complex branching, and anything where you need to move data back to the CPU in the middle of computation.

The practical consequence: a GPU that is not batching work is a GPU that is mostly idle. The most common pattern for teams deploying AI inference in production ,one request comes in, run the model, return the result, wait for the next request ,uses a small fraction of the GPU’s actual capacity. The GPU’s utilisation number looks reasonable. The GPU’s actual computational throughput is terrible.

This is the equivalent of a database that opens a new connection for every query, executes it, and closes the connection. Technically functional. Completely missing how the system should be used.

# What most teams do: one request, one inference
# GPU utilisation looks like 20-40%, but throughput is poor

async def handle_inference_request(prompt: str) -> str:
    result = model.generate(prompt)  # GPU mostly idle while waiting
    return result


# What should be happening: dynamic batching
# Multiple requests grouped and processed together

class InferenceBatcher:
    def __init__(self, model, max_batch_size: int = 32, max_wait_ms: int = 50):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue: asyncio.Queue = asyncio.Queue()

    async def infer(self, prompt: str) -> str:
        future = asyncio.Future()
        await self.queue.put((prompt, future))
        return await future

    async def _batch_worker(self):
        while True:
            batch = []
            deadline = asyncio.get_event_loop().time() + (self.max_wait_ms / 1000)

            # Collect requests until batch is full or deadline passes
            while len(batch) < self.max_batch_size:
                timeout = deadline - asyncio.get_event_loop().time()
                if timeout <= 0:
                    break
                try:
                    item = await asyncio.wait_for(
                        self.queue.get(),
                        timeout=timeout
                    )
                    batch.append(item)
                except asyncio.TimeoutError:
                    break

            if not batch:
                continue

            prompts = [item[0] for item in batch]
            futures = [item[1] for item in batch]

            # Single GPU call processes all requests simultaneously
            results = self.model.generate_batch(prompts)

            for future, result in zip(futures, results):
                future.set_result(result)

Dynamic batching is the connection pooling of GPU inference. It is not optional if you care about cost or throughput. It is also not implemented by default in most hand-rolled inference deployments, for the same reason that early web applications didn’t implement connection pooling: teams didn’t know they needed it until they hit the wall.

The memory hierarchy nobody teaches you

GPU memory is not like CPU memory. Understanding the difference is the difference between a system that works and one that doesn’t, and between inference costs that are manageable and ones that are not.

A GPU has its own on-device memory ,VRAM. VRAM is fast, finite, and expensive. A GPU with 80GB of VRAM is a very expensive GPU. The model you’re running must fit in VRAM. If it doesn’t fit, you can use techniques like quantization to make it smaller, or you can distribute it across multiple GPUs, but you cannot simply overflow to system RAM without taking a catastrophic performance hit. The bandwidth between CPU RAM and GPU VRAM is orders of magnitude slower than VRAM bandwidth. When you hear about models being “quantized to 4-bit,” this is why ,4-bit quantization halves the memory footprint roughly, which is the difference between fitting on one GPU and not fitting on one GPU.

Within the GPU itself, there is a memory hierarchy that determines how fast computation runs. The KV cache ,the cached attention computation for the tokens already processed in a conversation — lives in VRAM and grows with sequence length. Managing KV cache is one of the most consequential performance decisions in LLM serving, and most teams don’t think about it at all until they start hitting out-of-memory errors on long contexts.

# KV cache management: what happens without it
# Each new token regenerates attention for the entire context
# Cost is O(n²) in sequence length

# What vLLM and similar systems do differently:
# PagedAttention manages KV cache in fixed-size blocks
# like virtual memory paging in an OS

# This allows:
# 1. Sharing KV cache between requests with the same prefix
# 2. Better memory utilisation (no internal fragmentation)
# 3. Handling variable-length sequences without pre-allocating
#    worst-case memory

from vllm import LLM, SamplingParams

# vLLM handles KV cache management automatically
# This is not a minor optimisation ,it's 2-4x throughput improvement
# on typical workloads versus naive implementations

llm = LLM(
    model="meta-llama/Llama-3-8b-instruct",
    gpu_memory_utilization=0.90,    # Leave 10% headroom
    max_model_len=8192,             # Maximum sequence length
    enable_prefix_caching=True,     # Cache common prefixes (system prompts)
    tensor_parallel_size=1,         # Number of GPUs for this model
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=512,
)

# Prefix caching means your system prompt is computed once
# and cached for all subsequent requests ,significant for
# long system prompts used with every inference
outputs = llm.generate(prompts, sampling_params)

Most teams serving LLMs in production are not using PagedAttention. They’re using naive inference implementations that waste fifty to seventy percent of their GPU memory to fragmentation and redundant computation. The cost difference is not marginal.

The scaling question everyone asks wrong

When a team’s AI infrastructure starts struggling under load, the first question is almost always: “should we add more GPUs?”

This is the wrong question asked at the wrong time, for the same reason that “should we add more database servers” was the wrong first question when a database was struggling in 2008. The right question is: “why are we using our current GPUs so inefficiently?”

GPU utilisation that is below sixty percent is almost always a batching problem. Requests are not being grouped efficiently before hitting the GPU. You can add more GPUs and halve your utilisation number, which means you now have twice the infrastructure running at thirty percent capacity instead of one set running at sixty. You’ve doubled your cost and solved nothing.

GPU utilisation that is high but latency is still bad is almost always a model sizing problem. The model is too large for the request volume being served. A smaller quantized model, or a different architecture, may serve your request latency requirements at a fraction of the compute cost.

# Measuring what actually matters before deciding to scale

import time
import psutil
from prometheus_client import Histogram, Gauge, Counter

# These metrics tell you where the problem actually is

GPU_UTILISATION = Gauge(
    'gpu_utilisation_percent',
    'GPU compute utilisation',
    ['device_id']
)

GPU_MEMORY_USED = Gauge(
    'gpu_memory_used_bytes',
    'GPU VRAM in use',
    ['device_id']
)

BATCH_SIZE = Histogram(
    'inference_batch_size',
    'Number of requests processed per batch',
    buckets=[1, 2, 4, 8, 16, 32, 64]
)

TOKENS_PER_SECOND = Histogram(
    'inference_tokens_per_second',
    'Throughput of inference in tokens per second',
    buckets=[10, 25, 50, 100, 200, 400, 800]
)

TIME_TO_FIRST_TOKEN = Histogram(
    'inference_ttft_seconds',
    'Time from request to first token generated',
    buckets=[.05, .1, .25, .5, 1, 2, 5]
)

REQUEST_QUEUE_DEPTH = Gauge(
    'inference_queue_depth',
    'Number of requests waiting for GPU'
)


class InstrumentedInferenceServer:
    async def infer(self, prompts: list[str]) -> list[str]:
        BATCH_SIZE.observe(len(prompts))
        REQUEST_QUEUE_DEPTH.set(self.queue.qsize())

        start = time.perf_counter()
        results = await self._run_inference(prompts)
        duration = time.perf_counter() - start

        total_tokens = sum(len(r.split()) for r in results)
        TOKENS_PER_SECOND.observe(total_tokens / duration)

        return results

When you can see batch sizes, queue depth, tokens per second, and time-to-first-token alongside GPU utilisation and VRAM usage, the question of “do we need more GPUs” almost answers itself. Usually the answer is “no, we need to batch better” or “no, we need to use a smaller model” and scaling turns out to be unnecessary.

The cold start problem nobody planned for

Databases take seconds to start. GPU inference servers take minutes.

A database that restarts unexpectedly is back within thirty seconds in most cases. An LLM inference server that restarts needs to load model weights from storage into VRAM before it can serve any requests. A 70B parameter model stored in 4-bit quantization is roughly 35GB. Loading 35GB from network storage into VRAM, at typical cloud storage bandwidth, takes several minutes under good conditions.

This changes incident dynamics entirely. A database blip is a brief interruption. A GPU server blip is a several-minute outage for every affected instance. Autoscaling, which works well for stateless application servers and adequately for databases, works badly for GPU inference because new instances take so long to become ready.

The teams that have worked this out run warm pools ,GPU instances with models already loaded, sitting idle, waiting for traffic that hasn’t arrived yet. This feels wasteful. It’s the only way to handle traffic spikes without minutes-long latency blowouts.

# Kubernetes deployment with warm pool strategy
# Minimum replicas keep instances warm even at low traffic
# This costs money. The alternative is cold start latency.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
spec:
  replicas: 3  # Never scale below this. These are your warm pool.
  template:
    spec:
      containers:
      - name: inference-server
        image: myteam/inference:latest
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          # Model loading takes time. This probe must not pass
          # until the model is fully loaded in VRAM.
          initialDelaySeconds: 180   # 3 minutes minimum
          periodSeconds: 10
          failureThreshold: 30       # 5 more minutes of retries
        lifecycle:
          preStop:
            exec:
              # Drain in-flight requests before shutdown
              command: ["/bin/sh", "-c", "sleep 30"]

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 3    # Warm pool floor
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metric:
        name: inference_queue_depth
      target:
        type: AverageValue
        averageValue: "5"  # Scale when queue exceeds 5 requests per replica
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120   # Add 2 pods max every 2 minutes
                             # Fast enough to respond, slow enough
                             # to not over-provision during spikes
    scaleDown:
      stabilizationWindowSeconds: 600  # Wait 10 minutes before scaling down
                                       # Cold start cost makes yo-yo scaling
                                       # extremely expensive

The scaleDown stabilization window is long deliberately. The cold start cost is so high that scaling down and back up in response to a brief traffic dip is more expensive than just keeping the instances running. This is counterintuitive if you’re coming from stateless web services. It’s the operational reality of GPU infrastructure.

The cost model is upside down

Database costs scale with data volume and query complexity. You pay more as your data grows and your queries get more complex.

GPU costs scale with time. You pay for every second a GPU exists, whether it’s serving requests or not. An idle GPU costs the same as a busy GPU.

This inverts the normal infrastructure economics. With stateless application servers, idle capacity is cheap ,you can scale to zero when traffic drops and pay nothing. With GPU inference, scaling to zero means cold starts when traffic returns. The minimum viable capacity for a production inference service is not zero ,it’s whatever your warm pool needs to be, which is determined by your acceptable cold start latency and your traffic spike patterns.

The teams that have made peace with this have stopped thinking about GPU cost as a variable cost that tracks usage and started thinking about it as a fixed cost that buys capacity. The question is not “how do we pay less for GPU when traffic is low?” The question is “what is the right amount of always-on capacity, and how do we make sure we use it efficiently?”

Efficient use means high batch fill rates, high token throughput per GPU-hour, low idle time. The metrics above are the inputs to this calculation. Without them you’re guessing about whether your infrastructure is sized correctly.

The pattern that’s emerging

The teams operating GPU infrastructure well in 2026 look, in their operational discipline, a lot like the teams that operated databases well in 2012 after enough people had been burned that the patterns were starting to solidify.

They treat GPU utilisation as a lagging indicator and token throughput as the leading one. They instrument everything: batch sizes, queue depth, time-to-first-token, VRAM usage, KV cache hit rates. They size their warm pools based on measured traffic patterns rather than intuition. They run the smallest model that meets their quality bar, not the largest model they can afford, because smaller models batched efficiently outperform larger models batched poorly on almost every practical metric.

They’ve also accepted something that takes a while to accept: that the right abstraction for GPU infrastructure is not “fast compute” but “throughput capacity.” The question is not “how fast can this machine process one request?” GPUs are fast at that regardless. The question is “how many requests per dollar can this infrastructure handle at acceptable latency?” That question requires different metrics, different architecture, and a different mental model than the one most teams bring from their experience with CPU infrastructure.

The database analogy runs deeper than it looks. In 2004, the teams that treated the database as a black box ,put data in, get data out, add more RAM when it’s slow ,eventually hit walls that their architecture couldn’t get past. The teams that understood what was happening inside the database ,query plans, index usage, lock contention, buffer pool behaviour ,built things that scaled.

The GPU is not a black box. It has a memory hierarchy, a batching model, a cost structure, and performance characteristics that reward understanding and punish ignorance in the same way the database did.

The patterns are forming. The teams learning them now will have the same advantage in five years that database-literate engineers had in 2015.

The mistakes are happening right now, at scale, expensively. Most of them are the same mistakes. Most of them are avoidable.