DevOpsArchitectureDev Tooling

Everyone Is Writing Terraform. Almost Nobody Is Writing It Well.

Terraform has won. That much is settled. If your team is provisioning cloud infrastructure in 2026 and you’re not using some form of infrastructure as code, you’re in the minority. The tooling has matured, the providers cover everything, the community is enormous, and the pitch — your infrastructure is code, so it’s version-controlled, reviewable, reproducible, and safe — is genuinely compelling.

Terraform

The problem is that most teams get the tool without getting the discipline. They write Terraform that provisions infrastructure but doesn’t deliver on the actual promise: that you can understand what you have, trust what you’re changing, and recover when something goes wrong.

I’ve looked at a lot of Terraform codebases. The good ones are rarer than they should be. The bad ones share the same failure patterns, in the same order of discovery, with the same consequences.

This is what those patterns look like, why they develop, and what the alternative is.

The state file problem nobody talks about enough

Everything in Terraform depends on the state file. It’s the source of truth for what Terraform believes exists in your infrastructure. When you run terraform plan, Terraform compares the state file against your configuration and your actual cloud resources to figure out what needs to change. When state is wrong, everything downstream is wrong.

Most teams start with local state. It’s the default. You run terraform init, a terraform.tfstate file appears locally, and everything works. Until it doesn’t.

Local state means the state file lives on someone’s machine. The first time a second person runs Terraform, they don’t have the state file, so Terraform thinks nothing exists, and tries to create everything from scratch. This is a terrifying moment that usually involves a lot of terraform import commands and a conversation about why this wasn’t set up correctly from the start.

Remote state solves this. But remote state without locking introduces a new problem: two people running terraform apply simultaneously will corrupt the state file, because both are writing to it concurrently without coordination. S3 without DynamoDB locking, or any remote backend without locking, is worse than local state in some ways because it creates the illusion of safety while being corruptible.

The correct setup from day one:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true

    # The part most people forget:
    dynamodb_table = "mycompany-terraform-locks"
  }
}
# The DynamoDB table for locking (bootstrap this manually or with a
# separate Terraform workspace before anything else)
resource "aws_dynamodb_table" "terraform_locks" {
  name         = "mycompany-terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

State file in S3, encrypted at rest. Locking via DynamoDB. This is not advanced Terraform. This is the minimum viable setup for a team of more than one person. It should be the first thing you configure, before you write a single resource.

Almost nobody does this. Most teams migrate to remote state after the first incident where someone runs Terraform from the wrong machine and tries to recreate the entire production database.

The monolith problem

A team starts with one Terraform directory. It has a VPC, some EC2 instances, an RDS database, some S3 buckets, a few IAM roles. Perfectly manageable.

Six months later, the same directory has a hundred and fifty resources. Running terraform plan takes eight minutes because Terraform is refreshing the state of every resource against the live cloud account. A junior developer makes a change to a security group, runs terraform apply, and while Terraform is applying the security group change it also shows a diff on the RDS instance because someone changed a parameter in the meantime. Now you have an unexpected database change bundled into a security group deployment.

The blast radius of any Terraform apply is the entire state file. In a monolith, that’s everything.

The fix is state isolation. Different pieces of infrastructure that have different change rates and different risk profiles should live in different state files. The structure I’ve found most useful: terraform/ ├── bootstrap/ # The state backend itself. Applied once, rarely touched. │ ├── main.tf │ └── backend.tf │ ├── networking/ # VPC, subnets, routing. Changes rarely. │ ├── main.tf │ ├── variables.tf │ ├── outputs.tf │ └── backend.tf │ ├── data/ # RDS, ElastiCache, S3. Changes carefully. │ ├── main.tf │ ├── variables.tf │ ├── outputs.tf │ └── backend.tf │ ├── compute/ # EC2, ECS, Lambda. Changes regularly. │ ├── main.tf │ ├── variables.tf │ ├── outputs.tf │ └── backend.tf │ └── application/ # App-specific resources. Changes frequently. ├── main.tf ├── variables.tf ├── outputs.tf └── backend.tf

Each directory has its own state file. A change to the application layer doesn’t touch the networking state. A terraform plan in compute/ takes thirty seconds instead of eight minutes because it’s only refreshing compute resources. The blast radius of any apply is scoped to the layer being changed.

Layers communicate through data sources that read remote state:

# In compute/main.tf — reading outputs from the networking layer
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "mycompany-terraform-state"
    key    = "production/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "api" {
  ami           = var.ami_id
  instance_type = "t3.medium"

  # Using the VPC subnet from the networking layer
  subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}

This setup requires more upfront thought about your architecture. It prevents the sprawl that makes monolith Terraform increasingly dangerous over time.

Variables that aren’t variables

Here is a pattern I see constantly:

# variables.tf
variable "environment" {
  description = "Environment name"
  type        = string
}

variable "db_instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.t3.medium"
}

variable "db_allocated_storage" {
  description = "RDS allocated storage in GB"
  type        = number
  default     = 20
}

And then:

# terraform.tfvars
environment          = "production"
db_instance_class    = "db.t3.medium"
db_allocated_storage = 20

The variables are just configuration buried one level deeper. They’re not actually variable — they’re hardcoded values that live in a .tfvars file instead of directly in the resource block. The abstraction adds cognitive overhead without adding flexibility.

Real variables have validation. They constrain what’s allowed, which prevents misconfiguration:

variable "environment" {
  description = "Deployment environment"
  type        = string

  validation {
    condition     = contains(["development", "staging", "production"], var.environment)
    error_message = "environment must be development, staging, or production."
  }
}

variable "db_instance_class" {
  description = "RDS instance class. Must be from the approved list."
  type        = string

  validation {
    condition = contains([
      "db.t3.micro",
      "db.t3.medium",
      "db.r6g.large",
      "db.r6g.xlarge"
    ], var.db_instance_class)
    error_message = "db_instance_class must be from the approved instance types."
  }
}

variable "db_allocated_storage" {
  description = "RDS allocated storage in GB. Must be between 20 and 1000."
  type        = number

  validation {
    condition     = var.db_allocated_storage >= 20 && var.db_allocated_storage <= 1000
    error_message = "db_allocated_storage must be between 20 and 1000 GB."
  }
}

Now terraform validate catches misconfiguration before terraform plan even runs. Someone cannot accidentally deploy a db.r5.24xlarge instance because they copy-pasted from a different config. The validation is documentation that executes.

Modules that aren’t reusable

Terraform modules are the abstraction mechanism. A module should take inputs, provision related resources, and return outputs. It should be usable in multiple contexts without modification.

Most Terraform modules I’ve seen are not reusable. They’re extractions — the team took a chunk of configuration out of the main file and put it in a subdirectory called modules/. The inputs are so specific to the current use case that the module can’t be used anywhere else without editing it. The outputs expose the wrong things. The internal resource naming assumes it’ll only ever be instantiated once.

A well-written module:

# modules/rds-postgres/variables.tf
variable "name" {
  description = "Name prefix for all resources. Must be unique within the account."
  type        = string
}

variable "environment" {
  type = string
}

variable "vpc_id" {
  description = "VPC where the database will be created."
  type        = string
}

variable "subnet_ids" {
  description = "Subnets for the DB subnet group. Should be private subnets."
  type        = list(string)
}

variable "allowed_security_group_ids" {
  description = "Security groups allowed to connect to the database."
  type        = list(string)
}

variable "instance_class" {
  type    = string
  default = "db.t3.medium"
}

variable "allocated_storage_gb" {
  type    = number
  default = 20
}

variable "deletion_protection" {
  description = "Prevent accidental deletion. Should be true in production."
  type        = bool
  default     = true
}
# modules/rds-postgres/main.tf
resource "aws_db_subnet_group" "this" {
  name       = "${var.name}-${var.environment}"
  subnet_ids = var.subnet_ids

  tags = {
    Name        = "${var.name}-${var.environment}"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_security_group" "this" {
  name        = "${var.name}-${var.environment}-rds"
  description = "Controls access to ${var.name} RDS instance"
  vpc_id      = var.vpc_id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = var.allowed_security_group_ids
  }

  tags = {
    Name        = "${var.name}-${var.environment}-rds"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_db_instance" "this" {
  identifier = "${var.name}-${var.environment}"

  engine         = "postgres"
  engine_version = "16"
  instance_class = var.instance_class

  allocated_storage     = var.allocated_storage_gb
  max_allocated_storage = var.allocated_storage_gb * 4

  db_name  = replace(var.name, "-", "_")
  username = "dbadmin"
  password = random_password.this.result

  db_subnet_group_name   = aws_db_subnet_group.this.name
  vpc_security_group_ids = [aws_security_group.this.id]

  backup_retention_period = var.environment == "production" ? 7 : 1
  deletion_protection     = var.deletion_protection
  skip_final_snapshot     = var.environment != "production"

  storage_encrypted = true

  tags = {
    Name        = "${var.name}-${var.environment}"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "random_password" "this" {
  length  = 32
  special = false
}
# modules/rds-postgres/outputs.tf
output "endpoint" {
  description = "Connection endpoint for the RDS instance."
  value       = aws_db_instance.this.endpoint
  sensitive   = true
}

output "port" {
  value = aws_db_instance.this.port
}

output "database_name" {
  value = aws_db_instance.this.db_name
}

output "security_group_id" {
  description = "Security group ID. Reference this to allow access from other resources."
  value       = aws_security_group.this.id
}

output "password_secret_arn" {
  description = "ARN of the Secrets Manager secret containing the database password."
  value       = aws_secretsmanager_secret.db_password.arn
  sensitive   = true
}

This module can be instantiated multiple times, for different services, in different environments, without modification. The naming is parameterised. The security is environment-aware. The inputs and outputs are general enough to be useful in contexts the original author didn’t anticipate.

The plan nobody reviews

Here is the Terraform workflow at most teams:

  1. Make a change
  2. Run terraform plan
  3. Glance at the output
  4. Run terraform apply
  5. Hope

The terraform plan output is the most important artifact in the entire workflow. It’s the diff between your current infrastructure and your desired state. It tells you exactly what is going to be created, modified, or destroyed. It is the last opportunity to catch a misconfiguration before it affects production.

It is almost never read carefully.

The output is verbose, the format is not immediately intuitive, and there is social pressure to ship. So developers skim it, confirm that the high-level counts look right — “4 to add, 1 to change, 0 to destroy, looks fine” — and apply.

The one-to-change is the RDS instance getting a parameter group update that requires a reboot. The reboot will cause three minutes of database downtime. Nobody noticed because it was one line in a sea of output.

The practice that prevents this: mandatory plan review for anything that touches stateful resources, as part of the CI pipeline. Not optional. Not “we review important changes.” Every change to a database, a storage bucket, a network configuration, or an IAM policy gets a plan output attached to the pull request, and the plan is reviewed before the PR is merged.

# .github/workflows/terraform.yml
name: Terraform Plan

on:
  pull_request:
    paths:
      - 'terraform/**'

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init
        working-directory: terraform/production

      - name: Terraform Plan
        id: plan
        run: |
          terraform plan \
            -out=tfplan \
            -detailed-exitcode \
            2>&1 | tee plan-output.txt
        working-directory: terraform/production

      - name: Post plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('terraform/production/plan-output.txt', 'utf8');
            const truncated = plan.length > 60000
              ? plan.substring(0, 60000) + '\n... (truncated)'
              : plan;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan\n\`\`\`\n${truncated}\n\`\`\``
            });

      - name: Block on destructive changes
        run: |
          if grep -q "will be destroyed" terraform/production/plan-output.txt; then
            echo "ERROR: Plan contains destructive changes. Requires explicit approval."
            exit 1
          fi

Destructive changes fail the pipeline automatically. Non-destructive changes post the full plan to the PR for review. No apply without a reviewed plan.

The AI angle

AI is changing Terraform in the same way it’s changing application development, and the failure modes are the same.

Generating Terraform with AI assistance is easy. The syntax is well-structured, the providers are well-documented, and models produce plausible-looking configuration quickly. A developer who doesn’t know Terraform well can describe what they want and get working configuration back.

The problem: working configuration is not the same as correct configuration. AI-generated Terraform tends to have specific failure modes:

It generates resources without tags. Tags seem optional until you need to do cost allocation, security auditing, or resource cleanup and you have no way to identify which resources belong to which team or environment.

It generates IAM policies that are broader than necessary. The path of least resistance when generating IAM is to use managed policies (AdministratorAccess, AmazonS3FullAccess) rather than crafting minimal permissions. This works and is a security problem.

It generates resources without considering deletion protection. A generated RDS instance won’t have deletion_protection = true unless you specifically ask for it, because the model doesn’t know that you’d be devastated if someone ran terraform destroy on your production database.

It generates configuration without considering the plan review consequences. A change that looks small in code can produce a destructive plan. The model doesn’t simulate the plan.

# AI-generated. Looks fine. Has problems.
resource "aws_s3_bucket" "uploads" {
  bucket = "myapp-uploads"
}

# What it should look like:
resource "aws_s3_bucket" "uploads" {
  bucket = "myapp-${var.environment}-uploads-${data.aws_caller_identity.current.account_id}"

  tags = {
    Name        = "myapp-${var.environment}-uploads"
    Environment = var.environment
    ManagedBy   = "terraform"
    Team        = "platform"
  }
}

resource "aws_s3_bucket_versioning" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "uploads" {
  bucket                  = aws_s3_bucket.uploads.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

The bucket name is unique across accounts. It’s tagged. It has versioning, encryption, and public access blocking. An AI generated the first version. An engineer who understands S3 security wrote the second.

Use AI to generate the first draft. Review it like you review any other infrastructure change — which means understanding what every resource does, why every parameter is set the way it is, and what the plan will look like before you apply it.

What good Terraform looks like in practice

The teams whose Terraform codebases I’ve looked at and felt genuinely comfortable with share a few characteristics.

The state is remote and locked from the first commit. The infrastructure is layered by change rate, with separate state files per layer. Variables have validation. Modules are actually reusable — you can search the codebase for module " and find the same module instantiated multiple times in multiple contexts. Every resource has consistent tags that include at minimum the environment, the team, and a ManagedBy = "terraform" tag that makes it immediately obvious whether a resource should be touched by a human or left to Terraform.

Plans are reviewed in CI. Destructive changes require explicit acknowledgment. The pipeline applies only what has been reviewed and approved, not whatever the current local state of someone’s checkout happens to be.

And crucially: the team can explain every resource in the codebase. Not necessarily from memory, but from reading. The configuration is written for the human reader as much as for Terraform. Resource names make sense. Comments exist where the reasoning isn’t obvious. The person who joins the team six months from now can read the Terraform and understand what was built and why.

That last part is the real promise of infrastructure as code. Not that the infrastructure is automated — everything can be scripted. That the infrastructure is understandable. That the decisions are recorded in a form that can be reviewed, questioned, and improved.

Most Terraform codebases don’t deliver that. They deliver automation without understanding, which is its own kind of fragility.

The tool is not the discipline. Getting Terraform right requires both.