Everyone Is Writing Terraform. Almost Nobody Is Writing It Well.
Terraform has won. That much is settled. If your team is provisioning cloud infrastructure in 2026 and you’re not using some form of infrastructure as code, you’re in the minority. The tooling has matured, the providers cover everything, the community is enormous, and the pitch — your infrastructure is code, so it’s version-controlled, reviewable, reproducible, and safe — is genuinely compelling.

The problem is that most teams get the tool without getting the discipline. They write Terraform that provisions infrastructure but doesn’t deliver on the actual promise: that you can understand what you have, trust what you’re changing, and recover when something goes wrong.
I’ve looked at a lot of Terraform codebases. The good ones are rarer than they should be. The bad ones share the same failure patterns, in the same order of discovery, with the same consequences.
This is what those patterns look like, why they develop, and what the alternative is.
The state file problem nobody talks about enough
Everything in Terraform depends on the state file. It’s the source of truth
for what Terraform believes exists in your infrastructure. When you run
terraform plan, Terraform compares the state file against your
configuration and your actual cloud resources to figure out what needs to
change. When state is wrong, everything downstream is wrong.
Most teams start with local state. It’s the default. You run terraform init,
a terraform.tfstate file appears locally, and everything works. Until it
doesn’t.
Local state means the state file lives on someone’s machine. The first time
a second person runs Terraform, they don’t have the state file, so Terraform
thinks nothing exists, and tries to create everything from scratch. This is
a terrifying moment that usually involves a lot of terraform import commands
and a conversation about why this wasn’t set up correctly from the start.
Remote state solves this. But remote state without locking introduces a new
problem: two people running terraform apply simultaneously will corrupt the
state file, because both are writing to it concurrently without coordination.
S3 without DynamoDB locking, or any remote backend without locking, is worse
than local state in some ways because it creates the illusion of safety while
being corruptible.
The correct setup from day one:
# backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
encrypt = true
# The part most people forget:
dynamodb_table = "mycompany-terraform-locks"
}
}
# The DynamoDB table for locking (bootstrap this manually or with a
# separate Terraform workspace before anything else)
resource "aws_dynamodb_table" "terraform_locks" {
name = "mycompany-terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
State file in S3, encrypted at rest. Locking via DynamoDB. This is not advanced Terraform. This is the minimum viable setup for a team of more than one person. It should be the first thing you configure, before you write a single resource.
Almost nobody does this. Most teams migrate to remote state after the first incident where someone runs Terraform from the wrong machine and tries to recreate the entire production database.
The monolith problem
A team starts with one Terraform directory. It has a VPC, some EC2 instances, an RDS database, some S3 buckets, a few IAM roles. Perfectly manageable.
Six months later, the same directory has a hundred and fifty resources. Running
terraform plan takes eight minutes because Terraform is refreshing the state
of every resource against the live cloud account. A junior developer makes a
change to a security group, runs terraform apply, and while Terraform is
applying the security group change it also shows a diff on the RDS instance
because someone changed a parameter in the meantime. Now you have an
unexpected database change bundled into a security group deployment.
The blast radius of any Terraform apply is the entire state file. In a monolith, that’s everything.
The fix is state isolation. Different pieces of infrastructure that have different change rates and different risk profiles should live in different state files. The structure I’ve found most useful: terraform/ ├── bootstrap/ # The state backend itself. Applied once, rarely touched. │ ├── main.tf │ └── backend.tf │ ├── networking/ # VPC, subnets, routing. Changes rarely. │ ├── main.tf │ ├── variables.tf │ ├── outputs.tf │ └── backend.tf │ ├── data/ # RDS, ElastiCache, S3. Changes carefully. │ ├── main.tf │ ├── variables.tf │ ├── outputs.tf │ └── backend.tf │ ├── compute/ # EC2, ECS, Lambda. Changes regularly. │ ├── main.tf │ ├── variables.tf │ ├── outputs.tf │ └── backend.tf │ └── application/ # App-specific resources. Changes frequently. ├── main.tf ├── variables.tf ├── outputs.tf └── backend.tf
Each directory has its own state file. A change to the application layer
doesn’t touch the networking state. A terraform plan in compute/ takes
thirty seconds instead of eight minutes because it’s only refreshing compute
resources. The blast radius of any apply is scoped to the layer being changed.
Layers communicate through data sources that read remote state:
# In compute/main.tf — reading outputs from the networking layer
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "mycompany-terraform-state"
key = "production/networking/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "api" {
ami = var.ami_id
instance_type = "t3.medium"
# Using the VPC subnet from the networking layer
subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]
}
This setup requires more upfront thought about your architecture. It prevents the sprawl that makes monolith Terraform increasingly dangerous over time.
Variables that aren’t variables
Here is a pattern I see constantly:
# variables.tf
variable "environment" {
description = "Environment name"
type = string
}
variable "db_instance_class" {
description = "RDS instance class"
type = string
default = "db.t3.medium"
}
variable "db_allocated_storage" {
description = "RDS allocated storage in GB"
type = number
default = 20
}
And then:
# terraform.tfvars
environment = "production"
db_instance_class = "db.t3.medium"
db_allocated_storage = 20
The variables are just configuration buried one level deeper. They’re not
actually variable — they’re hardcoded values that live in a .tfvars file
instead of directly in the resource block. The abstraction adds cognitive
overhead without adding flexibility.
Real variables have validation. They constrain what’s allowed, which prevents misconfiguration:
variable "environment" {
description = "Deployment environment"
type = string
validation {
condition = contains(["development", "staging", "production"], var.environment)
error_message = "environment must be development, staging, or production."
}
}
variable "db_instance_class" {
description = "RDS instance class. Must be from the approved list."
type = string
validation {
condition = contains([
"db.t3.micro",
"db.t3.medium",
"db.r6g.large",
"db.r6g.xlarge"
], var.db_instance_class)
error_message = "db_instance_class must be from the approved instance types."
}
}
variable "db_allocated_storage" {
description = "RDS allocated storage in GB. Must be between 20 and 1000."
type = number
validation {
condition = var.db_allocated_storage >= 20 && var.db_allocated_storage <= 1000
error_message = "db_allocated_storage must be between 20 and 1000 GB."
}
}
Now terraform validate catches misconfiguration before terraform plan even
runs. Someone cannot accidentally deploy a db.r5.24xlarge instance because
they copy-pasted from a different config. The validation is documentation that
executes.
Modules that aren’t reusable
Terraform modules are the abstraction mechanism. A module should take inputs, provision related resources, and return outputs. It should be usable in multiple contexts without modification.
Most Terraform modules I’ve seen are not reusable. They’re extractions — the
team took a chunk of configuration out of the main file and put it in a
subdirectory called modules/. The inputs are so specific to the current use
case that the module can’t be used anywhere else without editing it. The
outputs expose the wrong things. The internal resource naming assumes it’ll
only ever be instantiated once.
A well-written module:
# modules/rds-postgres/variables.tf
variable "name" {
description = "Name prefix for all resources. Must be unique within the account."
type = string
}
variable "environment" {
type = string
}
variable "vpc_id" {
description = "VPC where the database will be created."
type = string
}
variable "subnet_ids" {
description = "Subnets for the DB subnet group. Should be private subnets."
type = list(string)
}
variable "allowed_security_group_ids" {
description = "Security groups allowed to connect to the database."
type = list(string)
}
variable "instance_class" {
type = string
default = "db.t3.medium"
}
variable "allocated_storage_gb" {
type = number
default = 20
}
variable "deletion_protection" {
description = "Prevent accidental deletion. Should be true in production."
type = bool
default = true
}
# modules/rds-postgres/main.tf
resource "aws_db_subnet_group" "this" {
name = "${var.name}-${var.environment}"
subnet_ids = var.subnet_ids
tags = {
Name = "${var.name}-${var.environment}"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_security_group" "this" {
name = "${var.name}-${var.environment}-rds"
description = "Controls access to ${var.name} RDS instance"
vpc_id = var.vpc_id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = var.allowed_security_group_ids
}
tags = {
Name = "${var.name}-${var.environment}-rds"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "aws_db_instance" "this" {
identifier = "${var.name}-${var.environment}"
engine = "postgres"
engine_version = "16"
instance_class = var.instance_class
allocated_storage = var.allocated_storage_gb
max_allocated_storage = var.allocated_storage_gb * 4
db_name = replace(var.name, "-", "_")
username = "dbadmin"
password = random_password.this.result
db_subnet_group_name = aws_db_subnet_group.this.name
vpc_security_group_ids = [aws_security_group.this.id]
backup_retention_period = var.environment == "production" ? 7 : 1
deletion_protection = var.deletion_protection
skip_final_snapshot = var.environment != "production"
storage_encrypted = true
tags = {
Name = "${var.name}-${var.environment}"
Environment = var.environment
ManagedBy = "terraform"
}
}
resource "random_password" "this" {
length = 32
special = false
}
# modules/rds-postgres/outputs.tf
output "endpoint" {
description = "Connection endpoint for the RDS instance."
value = aws_db_instance.this.endpoint
sensitive = true
}
output "port" {
value = aws_db_instance.this.port
}
output "database_name" {
value = aws_db_instance.this.db_name
}
output "security_group_id" {
description = "Security group ID. Reference this to allow access from other resources."
value = aws_security_group.this.id
}
output "password_secret_arn" {
description = "ARN of the Secrets Manager secret containing the database password."
value = aws_secretsmanager_secret.db_password.arn
sensitive = true
}
This module can be instantiated multiple times, for different services, in different environments, without modification. The naming is parameterised. The security is environment-aware. The inputs and outputs are general enough to be useful in contexts the original author didn’t anticipate.
The plan nobody reviews
Here is the Terraform workflow at most teams:
- Make a change
- Run
terraform plan - Glance at the output
- Run
terraform apply - Hope
The terraform plan output is the most important artifact in the entire
workflow. It’s the diff between your current infrastructure and your desired
state. It tells you exactly what is going to be created, modified, or
destroyed. It is the last opportunity to catch a misconfiguration before it
affects production.
It is almost never read carefully.
The output is verbose, the format is not immediately intuitive, and there is social pressure to ship. So developers skim it, confirm that the high-level counts look right — “4 to add, 1 to change, 0 to destroy, looks fine” — and apply.
The one-to-change is the RDS instance getting a parameter group update that requires a reboot. The reboot will cause three minutes of database downtime. Nobody noticed because it was one line in a sea of output.
The practice that prevents this: mandatory plan review for anything that touches stateful resources, as part of the CI pipeline. Not optional. Not “we review important changes.” Every change to a database, a storage bucket, a network configuration, or an IAM policy gets a plan output attached to the pull request, and the plan is reviewed before the PR is merged.
# .github/workflows/terraform.yml
name: Terraform Plan
on:
pull_request:
paths:
- 'terraform/**'
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init
working-directory: terraform/production
- name: Terraform Plan
id: plan
run: |
terraform plan \
-out=tfplan \
-detailed-exitcode \
2>&1 | tee plan-output.txt
working-directory: terraform/production
- name: Post plan to PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('terraform/production/plan-output.txt', 'utf8');
const truncated = plan.length > 60000
? plan.substring(0, 60000) + '\n... (truncated)'
: plan;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Terraform Plan\n\`\`\`\n${truncated}\n\`\`\``
});
- name: Block on destructive changes
run: |
if grep -q "will be destroyed" terraform/production/plan-output.txt; then
echo "ERROR: Plan contains destructive changes. Requires explicit approval."
exit 1
fi
Destructive changes fail the pipeline automatically. Non-destructive changes post the full plan to the PR for review. No apply without a reviewed plan.
The AI angle
AI is changing Terraform in the same way it’s changing application development, and the failure modes are the same.
Generating Terraform with AI assistance is easy. The syntax is well-structured, the providers are well-documented, and models produce plausible-looking configuration quickly. A developer who doesn’t know Terraform well can describe what they want and get working configuration back.
The problem: working configuration is not the same as correct configuration. AI-generated Terraform tends to have specific failure modes:
It generates resources without tags. Tags seem optional until you need to do cost allocation, security auditing, or resource cleanup and you have no way to identify which resources belong to which team or environment.
It generates IAM policies that are broader than necessary. The path of least
resistance when generating IAM is to use managed policies (AdministratorAccess,
AmazonS3FullAccess) rather than crafting minimal permissions. This works and
is a security problem.
It generates resources without considering deletion protection. A generated
RDS instance won’t have deletion_protection = true unless you specifically
ask for it, because the model doesn’t know that you’d be devastated if someone
ran terraform destroy on your production database.
It generates configuration without considering the plan review consequences. A change that looks small in code can produce a destructive plan. The model doesn’t simulate the plan.
# AI-generated. Looks fine. Has problems.
resource "aws_s3_bucket" "uploads" {
bucket = "myapp-uploads"
}
# What it should look like:
resource "aws_s3_bucket" "uploads" {
bucket = "myapp-${var.environment}-uploads-${data.aws_caller_identity.current.account_id}"
tags = {
Name = "myapp-${var.environment}-uploads"
Environment = var.environment
ManagedBy = "terraform"
Team = "platform"
}
}
resource "aws_s3_bucket_versioning" "uploads" {
bucket = aws_s3_bucket.uploads.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "uploads" {
bucket = aws_s3_bucket.uploads.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_public_access_block" "uploads" {
bucket = aws_s3_bucket.uploads.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
The bucket name is unique across accounts. It’s tagged. It has versioning, encryption, and public access blocking. An AI generated the first version. An engineer who understands S3 security wrote the second.
Use AI to generate the first draft. Review it like you review any other infrastructure change — which means understanding what every resource does, why every parameter is set the way it is, and what the plan will look like before you apply it.
What good Terraform looks like in practice
The teams whose Terraform codebases I’ve looked at and felt genuinely comfortable with share a few characteristics.
The state is remote and locked from the first commit. The infrastructure is
layered by change rate, with separate state files per layer. Variables have
validation. Modules are actually reusable — you can search the codebase for
module " and find the same module instantiated multiple times in multiple
contexts. Every resource has consistent tags that include at minimum the
environment, the team, and a ManagedBy = "terraform" tag that makes it
immediately obvious whether a resource should be touched by a human or left
to Terraform.
Plans are reviewed in CI. Destructive changes require explicit acknowledgment. The pipeline applies only what has been reviewed and approved, not whatever the current local state of someone’s checkout happens to be.
And crucially: the team can explain every resource in the codebase. Not necessarily from memory, but from reading. The configuration is written for the human reader as much as for Terraform. Resource names make sense. Comments exist where the reasoning isn’t obvious. The person who joins the team six months from now can read the Terraform and understand what was built and why.
That last part is the real promise of infrastructure as code. Not that the infrastructure is automated — everything can be scripted. That the infrastructure is understandable. That the decisions are recorded in a form that can be reviewed, questioned, and improved.
Most Terraform codebases don’t deliver that. They deliver automation without understanding, which is its own kind of fragility.
The tool is not the discipline. Getting Terraform right requires both.