Kubernetes Is Not Your First Problem

Kubernetes is genuinely impressive technology. The people who built it solved hard problems elegantly. The ecosystem around it — Helm, the operators, the service meshes, the GitOps tooling — represents years of serious engineering thinking about how to run software reliably at scale.

None of that means you should be running it.

I’ve had a version of the same conversation probably a dozen times in the last year. A team is having reliability problems, or deployment problems, or they’re moving slowly and they can’t figure out why. We start talking through the stack. At some point it comes out: they’re on Kubernetes. They have two or three backend services, maybe a frontend, a database, a cache. Their largest customer has a few hundred users. They are using Kubernetes because it seemed like the right thing to do, because their previous job used it, because the job posting they want to be qualified for lists it, because it’s what serious engineering teams use.

The reliability problems, the deployment complexity, the reason they’re moving slowly — in most of these conversations, Kubernetes is a significant part of the answer. Not because Kubernetes is bad. Because they adopted it before they had the problems it solves.

What Kubernetes actually solves

To understand when you don’t need Kubernetes, you need to be clear about what it actually does.

Kubernetes is a container orchestration system. Its core job is to take a desired state — I want three replicas of this service running, with these resource limits, in these availability zones, with this update strategy — and maintain that state even when things go wrong. Node goes down: Kubernetes reschedules the pods. Deployment fails: Kubernetes rolls back. Traffic spikes: Kubernetes scales up. All of this happens automatically, based on configuration you’ve written ahead of time.

This is enormously valuable when you’re running services at scale, when the cost of a service being unavailable is high, when you have enough traffic that manual intervention can’t keep up with failures, and when you have enough services that managing them individually is genuinely burdensome.

None of these conditions hold for most teams at most stages of their product’s life. A team with three services and a few hundred users does not have enough traffic that manual intervention can’t keep up with failures. They probably don’t have failures that require the kind of automated response Kubernetes provides. The cost of a five-minute outage, while not zero, is not the kind of existential problem that justifies significant operational complexity to prevent.

What they have is a deployment pipeline that takes twenty minutes to understand, a local development environment that requires a local Kubernetes cluster to accurately simulate production, and a collection of YAML that nobody is fully confident they understand. They have adopted the operational complexity appropriate for Google’s workloads to solve a problem that a single VM and a deployment script could handle.

The complexity tax

Kubernetes has a complexity tax. You pay it whether or not you’re getting value from the technology.

The tax has several line items.

Local development friction. Running Kubernetes locally — whether you use minikube, kind, k3s, or Docker Desktop’s built-in cluster — is meaningfully more complex than running services directly. Services that worked with docker compose up now require manifests, namespaces, ingress configuration, and a mental model of how the local cluster maps to production. New developers spend days getting their local environment right instead of hours. Experienced developers maintain a parallel set of mental models — how the service actually works, and how Kubernetes is configured to run it.

Debugging requires new skills. When something goes wrong in a traditional deployment, you ssh into the server, look at the logs, check the process. When something goes wrong in Kubernetes, you need to know the difference between a pod failing to schedule, a pod crashing at startup, a pod running but failing health checks, a service not routing to the pod, an ingress not routing to the service, and a pod running but throwing application errors. Each of these has different symptoms and different debugging commands. Building this mental model takes months. Until the team has it, debugging production incidents is slower than it should be.

# The debugging path that should take two minutes
# takes twenty when you don't know which layer to look at

# Is the pod running?
kubectl get pods -n production

# Why isn't it running?
kubectl describe pod my-service-7d9f8b-xkp2q -n production

# What are the logs saying?
kubectl logs my-service-7d9f8b-xkp2q -n production --previous

# Is the service routing correctly?
kubectl get endpoints my-service -n production

# Is the ingress configured correctly?
kubectl describe ingress my-service -n production

# What events have happened in the namespace?
kubectl get events -n production --sort-by='.lastTimestamp'

This is not inherently bad. These are the right tools for the right problem. But if your problem is “a small service is down and I need to fix it in the next ten minutes,” working through this hierarchy under pressure, with a team that’s still learning it, is not the fastest path to resolution.

Configuration is a surface area for errors. Kubernetes configuration is powerful and complex. Resource limits, liveness probes, readiness probes, pod disruption budgets, affinity rules, network policies — each of these is a knob that can be set correctly or incorrectly. Set the liveness probe threshold too low and Kubernetes will restart your pod while it’s handling a legitimate request. Set resource limits too tight and your pod gets OOMKilled under load. Forget to set resource limits and a noisy neighbor pod can starve your service of resources.

These are all real problems with real consequences, and they’re all problems you only have because you’re running Kubernetes. A service running on a VM doesn’t have liveness probes. It either runs or it doesn’t.

The mental overhead of desired state. The Kubernetes mental model is declarative: you describe what you want, and Kubernetes figures out how to achieve it. This is powerful at scale. It can be disorienting when you’re debugging, because the actual state and the desired state can diverge in ways that are not immediately obvious, and the controller loop that reconciles them can behave in ways that surprise you until you understand it.

What you should probably be running instead

There are simpler options, and in 2026 they are more capable than they’ve ever been.

A single well-configured VM handles more than you think. A 4-core, 16GB RAM VM — available from any cloud provider for under a hundred dollars a month — can handle serious production traffic for most applications at most stages of their growth. Combined with a process manager like systemd or supervisord, automatic restarts on failure, a reverse proxy like nginx or Caddy handling SSL and routing, and automated backups, you have a deployment that is simple enough for any developer on the team to understand and debug at 3am.

The reliability argument against this — “but if the VM goes down, everything goes down” — is often overstated. How often does a VM actually go down? For most cloud providers on reasonable hardware, the answer is rarely. The more common failure modes are application errors and deployment mistakes, and those hit you the same way whether you’re running on a VM or Kubernetes.

Docker Compose is underrated at small scale. For teams that want container isolation without orchestration complexity, Docker Compose gives you service definitions, networking, volume management, and enough operational leverage for most early-stage products. The mental model is simple. The local development environment matches production. Deployments are a docker compose pull && docker compose up -d.

# docker-compose.yml
# Simple. Readable. Deployable.
services:
  api:
    image: ghcr.io/myteam/api:${VERSION}
    restart: unless-stopped
    environment:
      DATABASE_URL: ${DATABASE_URL}
      REDIS_URL: redis://cache:6379
    depends_on:
      - cache
    ports:
      - "8000:8000"

  worker:
    image: ghcr.io/myteam/api:${VERSION}
    restart: unless-stopped
    command: python -m celery worker
    environment:
      DATABASE_URL: ${DATABASE_URL}
      REDIS_URL: redis://cache:6379
    depends_on:
      - cache

  cache:
    image: redis:7-alpine
    restart: unless-stopped
    volumes:
      - redis_data:/data

volumes:
  redis_data:

This is not a compromise. This is an appropriate tool for the problem size.

Managed platforms eliminate the infrastructure problem entirely. Railway, Render, Fly.io, and similar platforms take your Docker container and handle the orchestration, scaling, networking, and SSL for you. You don’t think about Kubernetes because there is no Kubernetes. There’s a deployment pipeline and a dashboard. The operational overhead is near zero.

The cost argument against managed platforms — “it’s more expensive per compute unit than running your own infrastructure” — is almost always wrong when you account for the engineering time that infrastructure requires. Managing a Kubernetes cluster is not free. Someone’s time is the most expensive resource a small team has, and every hour spent on infrastructure that a managed platform would handle is an hour not spent on the product.

When Kubernetes is the right answer

I’ve spent this article arguing against premature Kubernetes adoption. I want to be honest about when it’s the right choice, because the argument is not that Kubernetes is bad.

When you have actual scaling requirements. If your traffic is spiky enough and unpredictable enough that you need to scale services up and down automatically based on demand, and if the cost of scaling is significant enough to matter, Kubernetes autoscaling is genuinely valuable. This is a real production requirement. It’s also not a requirement most teams have for most of their product’s life.

When you have enough services that the overhead is justified. The complexity tax is roughly fixed: you pay it whether you have three services or thirty. At three services, the tax per service is enormous. At thirty services, the tax per service may be justified by the coordination benefits Kubernetes provides. There’s a number somewhere in between where it starts making sense. Most teams adopt Kubernetes before they reach that number.

When you have a dedicated platform team. Running Kubernetes well is a specialty. It requires ongoing attention to cluster updates, security patches, resource allocation, and incident response. If you have a dedicated platform engineering team whose job is to manage the infrastructure, the operational cost is contained and the benefits can be realized. If you’re expecting product engineers to manage the cluster alongside building features, the cost is distributed across the entire team’s velocity.

When your deployment targets require it. Some enterprise customers require on-premises deployment. Some regulatory environments require specific isolation guarantees that Kubernetes provides. Some products — developer tools, data platforms, ML infrastructure — are designed to run in Kubernetes environments and need to be built and tested there. These are real requirements. They’re not the requirements of most SaaS products at most stages.

The AI angle

Here’s the dimension of this conversation that’s changing right now.

AI-assisted development has increased the rate at which teams can produce working code. A team of four can now build and ship things at a pace that would have required eight people two years ago. This is good. The side effect: that same team of four is now operating more services, more surface area, more moving parts than they were before. The product grows faster than the team.

In this environment, operational complexity has a higher cost than it used to. Every hour a four-person team spends debugging a Kubernetes networking issue is a larger fraction of their total capacity than it would be for an eight- person team. The leverage argument for simple infrastructure is stronger now than it was when you could throw people at operational problems.

The teams I see doing the best right now have made a deliberate choice to keep their infrastructure as simple as possible, specifically because they’re moving fast on the product side and they cannot afford to have the infrastructure be a drag. They’ve consciously deferred Kubernetes — “we’ll migrate when we have ten services and actual autoscaling requirements” — and they’re running managed platforms or simple VM deployments in the meantime.

AI is also changing what Kubernetes complexity looks like in practice. Writing Kubernetes YAML is something models are good at — the syntax is well- represented in training data and the structure is predictable. Teams are generating manifests with AI assistance and deploying them without fully understanding what they’ve deployed. This is the vibe coding problem applied to infrastructure: the configuration is locally plausible, the deployment succeeds, and the failure mode only appears under conditions the team didn’t anticipate.

# Generated by AI. Looks reasonable. Has a problem.
apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myapp:latest
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3
        # Missing: resources limits and requests
        # Missing: readinessProbe (different from liveness)
        # Missing: terminationGracePeriodSeconds
        # "latest" tag means you don't know what's deployed

Infrastructure-as-code requires the same understanding discipline as application code. Generated YAML that you don’t understand is technical debt with operational consequences.

The question worth asking

Before you adopt Kubernetes, or before you continue running it if you’re already there, the question worth sitting with is this: what problem am I actually solving right now, and is this the right tool for that problem?

Not the problem you anticipate having in two years. Not the problem the company you want to work at is solving. The problem you have today, in your system, with your team, at your scale.

If the answer is “I need to orchestrate dozens of services across multiple availability zones with automatic scaling and rolling deployments and I have a team capable of operating this infrastructure” — Kubernetes is the right answer.

If the answer is “I need to deploy my three services reliably and move fast on the product” — you have simpler options, and using them is not a compromise. It’s an engineering decision that matches the tool to the problem.

The best infrastructure is the infrastructure you don’t have to think about. For most teams at most stages, that’s not Kubernetes.

It might be one day. When that day comes, you’ll know, because you’ll have specific problems that Kubernetes solves and simpler tools don’t. Until then, the boring option is usually the right one.

Resist the pull of impressive technology. Ship the product.