Terraform State Management at Scale

Aug 25, 2021 334 words · 2 min read

Terraform state is deceptively simple until you have multiple teams, dozens of repositories, and hundreds of resources. Then it becomes your biggest operational challenge.

The Problem

Local state files don’t scale. The moment two people run terraform apply simultaneously, you have a race condition. Storing state in Git seems clever until someone commits credentials embedded in the state file.

Remote state backends solve the concurrency problem but introduce new ones. A single state file for all infrastructure means slow plans, risky applies, and everyone waiting on everyone else. One team’s change can unexpectedly affect another’s resources.

State file corruption is rare but catastrophic when it happens. We experienced it once—recovering took days and left us questioning everything.

Our Solution

State isolation by component became our organising principle. Each logical unit of infrastructure got its own state file. Networking separate from compute. Shared services separate from application infrastructure. This meant smaller blast radii and faster operations.

S3 backend with DynamoDB locking provided the remote state foundation. Every state file lived in a dedicated path with consistent naming:

s3://company-terraform-state/
├── networking/
│   └── terraform.tfstate
├── eks-cluster/
│   └── terraform.tfstate
└── applications/
    ├── service-a/
    │   └── terraform.tfstate
    └── service-b/
        └── terraform.tfstate

Data sources replaced hardcoded references between state files. Rather than passing values manually, components queried what they needed.

Automated state backups ran daily. Versioning on the S3 bucket provided point-in-time recovery options.

Terraform workspaces handled environment separation where appropriate, though we preferred separate state files for production versus non-production.

The Benefits

Teams work independently. Changes to the networking layer don’t block application deployments. Each terraform plan completes quickly because it’s examining a focused scope.

Blast radius is contained. A mistake in one state file can’t corrupt another. Recovery involves one component rather than the entire infrastructure.

Onboarding simplified. New team members can understand a small, focused Terraform configuration. Nobody needs to comprehend the entire infrastructure graph to make changes.

State management isn’t glamorous, but getting it right prevents countless hours of incident response and frustrated debugging.