The Challenge
Early-stage startups face a paradox: they need to move fast, but every shortcut in infrastructure becomes expensive technical debt within six months. Manual deployments lead to inconsistency between environments, "works on my machine" bugs, and late-night firefighting when production breaks differently from staging. The teams we work with typically have a few engineers, no dedicated DevOps person, and a product that's growing faster than their deployment process.
The cost of fixing infrastructure debt compounds quickly. An engineer spending two hours a week on manual deployment tasks is 100 hours a year — and that number only grows as the team scales. More critically, without environment parity, each deployment is a gamble: the thing that worked in staging may not work in production because the configurations diverged months ago.
The goal isn't to over-engineer. A two-person startup doesn't need Kubernetes, service meshes, or a multi-region active-active setup. What they need is something simple enough to maintain and solid enough to scale to Series A without a full infrastructure rebuild. That's the bar we build to.
Signs You Need This
- Deployments happen via SSH or manual console clicks rather than an automated pipeline
- Your staging environment differs from production in ways nobody fully understands
- New engineers take more than a day to get a working local environment that mirrors production
- Infrastructure changes aren't tracked in version control — nobody knows what was changed when
- You've had a production incident caused by a configuration difference between environments
- Your CTO or a senior investor has flagged infrastructure as a due diligence concern
How We Approach It
Discovery & Architecture Mapping
We start by understanding your application: language, framework, how it scales horizontally, what it connects to (databases, queues, third-party APIs), and what your team is comfortable operating. We review your current AWS account, map existing resources, and identify what needs to be imported into Terraform vs. rebuilt. We don't impose a Kubernetes cluster on a two-person team shipping a Rails app — we recommend ECS Fargate if container orchestration is needed and Lambda if it isn't.
IaC Foundation with Terraform
Every resource is defined as code — nothing is clicked into existence in the console. We use a module-per-concern structure with remote state in S3 + DynamoDB locking. The layout we provision on day one:
infra/
├── environments/
│ ├── dev/
│ │ ├── main.tf # calls shared modules
│ │ ├── backend.tf # S3 state bucket + DynamoDB lock table
│ │ └── terraform.tfvars # instance sizes, replica counts, domain
│ ├── staging/ # identical structure, different values
│ └── prod/
└── modules/
├── networking/ # VPC, subnets, NACLs, security groups, NAT
├── compute/ # ECS cluster, task defs, ALB, auto-scaling
└── data/ # RDS (Multi-AZ), ElastiCache, S3 buckets
Environments share the same modules and differ only in .tfvars. This makes drift structurally harder — a prod change runs through the same Terraform plan as staging, just with different variable inputs. If it breaks staging, it never reaches prod.
Environment Strategy (Dev / Staging / Prod)
We create three environments that are structurally identical — same Terraform modules, different variable files. Staging mirrors production in terms of database engine version, ECS task sizing, and security group rules. Prod gets slightly more conservative resource limits and stricter change-management policies. We enforce environment parity from day one so there are no surprises at release time, and we document the intentional differences so future engineers know what to expect.
CI/CD Pipeline with GitHub Actions
Test → Build → ECR push → deploy to staging → manual gate → deploy to prod. Each stage gates the next. AWS authentication uses OIDC federation — no long-lived secrets stored in GitHub. Here's the actual pipeline structure we ship:
name: CI / Deploy
on:
push: { branches: [main] }
pull_request:
env:
ECR_REPO: 123456789.dkr.ecr.ap-south-1.amazonaws.com/app
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm test
build-push:
needs: test
runs-on: ubuntu-latest
permissions:
id-token: write # OIDC — no stored AWS access keys
contents: read
steps:
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }}
aws-region: ap-south-1
- run: |
aws ecr get-login-password | docker login --username AWS \
--password-stdin $ECR_REPO
docker build -t $ECR_REPO:${{ github.sha }} .
docker push $ECR_REPO:${{ github.sha }}
deploy-staging:
needs: build-push
if: github.ref == 'refs/heads/main'
environment: staging
runs-on: ubuntu-latest
steps:
- run: |
aws ecs update-service --cluster app-staging \
--service api --force-new-deployment
deploy-prod:
needs: deploy-staging
environment: production # requires named reviewer to approve
runs-on: ubuntu-latest
steps:
- run: |
aws ecs update-service --cluster app-prod \
--service api --force-new-deployment
The environment: production gate in GitHub requires a named reviewer to approve before the prod job runs. The OIDC role-to-assume is scoped with an IAM condition to only trust tokens from the main branch — a leaked token from a feature branch cannot trigger a production deployment.
Observability from Day One
CloudWatch dashboards for CPU, memory, and request latency per service. Log groups per service with structured JSON logging so logs are queryable by request ID, user ID, or error type. Alarms for error rate spikes, deployment failures, and database connection exhaustion routed to Slack via SNS. You shouldn't need to SSH into a server to understand why something is slow — the dashboard should tell you. We set retention policies on log groups from the start to avoid unbounded CloudWatch costs.
Architecture at a Glance
What the target-state AWS infrastructure looks like — the resources Terraform provisions and how traffic flows through them:
Internet
└── Route 53 → CloudFront (WAF + CDN)
└── ALB (HTTPS only, HTTP 301 redirect)
└── ECS Fargate (min 2 tasks, spread across AZs)
├── app:$GITHUB_SHA (ECR image)
└── → RDS Aurora PostgreSQL (Multi-AZ, automated backups)
└── → ElastiCache Redis (session store / cache)
CI/CD
└── GitHub Actions (OIDC) → IAM Role (branch-scoped)
├── ECR → push image
└── ECS → update-service (rolling deploy, 0 downtime)
Secrets & State
├── Terraform state → S3 (versioned) + DynamoDB (state lock)
├── App secrets → AWS Secrets Manager (injected at runtime, never in env vars)
└── Logs → CloudWatch Logs (30-day retention, structured JSON)
Tools We Use
Every tool is chosen for operational simplicity — your team will need to operate this after we hand it over.
Common Mistakes We Prevent
- Storing long-lived AWS credentials in GitHub Secrets instead of using OIDC federation — one leaked secret is a full account compromise
- Creating production-only resources manually in the console and forgetting to add them to Terraform, causing drift that breaks the next terraform apply
- Skipping the staging environment and deploying straight from dev to prod, removing the only safety net for catching environment-specific failures
- Using a single DynamoDB table or S3 bucket for all environments' Terraform state, which creates a blast radius where a broken prod deployment can corrupt dev state
Key principle: We build for the team that has to maintain this at 2am. Simple, documented, and automated is always better than clever and opaque. Every infrastructure decision we make, we ask: will a mid-level engineer who didn't build this be able to debug it under pressure?
What You Get
- Terraform codebase with full AWS infrastructure defined as code, organized by module
- Three environment setup (dev / staging / prod) with documented parity and intentional differences
- GitHub Actions CI/CD pipeline with automated testing, image builds, and environment-gated deployments
- CloudWatch dashboards and Slack alerting configured with meaningful thresholds
- S3 + DynamoDB Terraform state backend with per-environment isolation
- Architecture diagram and runbooks for common operations (rollback, scaling, database failover)
- 60-minute handover session with your engineering team and recorded walkthrough
Timeline & What to Expect
After handover, your team owns the infrastructure. We provide 30 days of async support for questions that come up as you get familiar with the setup. Most teams are self-sufficient by week two — the Terraform codebase is designed to be readable, and the runbooks cover the most common operational tasks in plain English.
Frequently Asked Questions
Do we need Kubernetes?
Almost certainly not at the early stage. Kubernetes adds significant operational overhead — cluster management, node group upgrades, networking complexity, and a steep learning curve. ECS Fargate gives you container orchestration without any of that management burden. We recommend Kubernetes when you have multiple teams deploying independent services and need the deployment consistency and ecosystem that comes with it — typically at 10+ engineers and 5+ services.
Can this work with our existing AWS account?
Yes. We start with an account audit to understand what's already there, import existing resources into Terraform where possible, and design the new infrastructure around what you already have. If the existing account is in poor shape (e.g., everything in the default VPC, no IAM structure), we'll surface that early and recommend whether to clean up in place or migrate to a new account structure.
What if we're not on AWS?
We work primarily with AWS because it's where most early-stage startups land and where our tooling expertise is deepest. If you're on GCP or Azure, the principles are identical — Terraform still manages the infrastructure, GitHub Actions still runs the pipeline — but the specific modules and services differ. Reach out and we'll be direct about whether we're the right fit for your cloud provider.
When This Is the Right Fit
This engagement is right for you if you're at the stage where manual deployments are starting to slow you down, you're onboarding more engineers and need repeatable processes, or you're heading toward a Series A and need infrastructure that looks credible to technical due diligence. Teams with 2–15 engineers who are AWS-hosted get the most value. If you already have Terraform but no CI/CD, we can plug in just the pipeline piece without rebuilding everything.
This is not the right fit if you have a dedicated platform engineering team with existing infrastructure standards — in that case you need execution bandwidth, not an architecture engagement. It's also not right if you're pre-product and not yet sure what your infrastructure requirements will be; in that case, start with a minimal serverless setup and engage us when you have clearer requirements.