AWS Infrastructure Setup + CI/CD for Early-Stage Startups

The Challenge

Early-stage startups face a paradox: they need to move fast, but every shortcut in infrastructure becomes expensive technical debt within six months. Manual deployments lead to inconsistency between environments, "works on my machine" bugs, and late-night firefighting when production breaks differently from staging. The teams we work with typically have a few engineers, no dedicated DevOps person, and a product that's growing faster than their deployment process.

The cost of fixing infrastructure debt compounds quickly. An engineer spending two hours a week on manual deployment tasks is 100 hours a year — and that number only grows as the team scales. More critically, without environment parity, each deployment is a gamble: the thing that worked in staging may not work in production because the configurations diverged months ago.

The goal isn't to over-engineer. A two-person startup doesn't need Kubernetes, service meshes, or a multi-region active-active setup. What they need is something simple enough to maintain and solid enough to scale to Series A without a full infrastructure rebuild. That's the bar we build to.

Signs You Need This

Deployments happen via SSH or manual console clicks rather than an automated pipeline
Your staging environment differs from production in ways nobody fully understands
New engineers take more than a day to get a working local environment that mirrors production
Infrastructure changes aren't tracked in version control — nobody knows what was changed when
You've had a production incident caused by a configuration difference between environments
Your CTO or a senior investor has flagged infrastructure as a due diligence concern

How We Approach It

Discovery & Architecture Mapping

We start by understanding your application: language, framework, how it scales horizontally, what it connects to (databases, queues, third-party APIs), and what your team is comfortable operating. We review your current AWS account, map existing resources, and identify what needs to be imported into Terraform vs. rebuilt. We don't impose a Kubernetes cluster on a two-person team shipping a Rails app — we recommend ECS Fargate if container orchestration is needed and Lambda if it isn't.

IaC Foundation with Terraform

Every resource is defined as code — nothing is clicked into existence in the console. We use a module-per-concern structure with remote state in S3 + DynamoDB locking. The layout we provision on day one:

infra/Terraform

infra/
├── environments/
│   ├── dev/
│   │   ├── main.tf           # calls shared modules
│   │   ├── backend.tf        # S3 state bucket + DynamoDB lock table
│   │   └── terraform.tfvars  # instance sizes, replica counts, domain
│   ├── staging/              # identical structure, different values
│   └── prod/
└── modules/
    ├── networking/            # VPC, subnets, NACLs, security groups, NAT
    ├── compute/               # ECS cluster, task defs, ALB, auto-scaling
    └── data/                  # RDS (Multi-AZ), ElastiCache, S3 buckets

Environments share the same modules and differ only in .tfvars. This makes drift structurally harder — a prod change runs through the same Terraform plan as staging, just with different variable inputs. If it breaks staging, it never reaches prod.

Environment Strategy (Dev / Staging / Prod)

We create three environments that are structurally identical — same Terraform modules, different variable files. Staging mirrors production in terms of database engine version, ECS task sizing, and security group rules. Prod gets slightly more conservative resource limits and stricter change-management policies. We enforce environment parity from day one so there are no surprises at release time, and we document the intentional differences so future engineers know what to expect.

CI/CD Pipeline with GitHub Actions

Test → Build → ECR push → deploy to staging → manual gate → deploy to prod. Each stage gates the next. AWS authentication uses OIDC federation — no long-lived secrets stored in GitHub. Here's the actual pipeline structure we ship:

.github/workflows/deploy.ymlGitHub Actions

name: CI / Deploy

on:
  push:    { branches: [main] }
  pull_request:

env:
  ECR_REPO: 123456789.dkr.ecr.ap-south-1.amazonaws.com/app

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm test

  build-push:
    needs: test
    runs-on: ubuntu-latest
    permissions:
      id-token: write    # OIDC — no stored AWS access keys
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }}
          aws-region: ap-south-1
      - run: |
          aws ecr get-login-password | docker login --username AWS \
            --password-stdin $ECR_REPO
          docker build -t $ECR_REPO:${{ github.sha }} .
          docker push $ECR_REPO:${{ github.sha }}

  deploy-staging:
    needs: build-push
    if: github.ref == 'refs/heads/main'
    environment: staging
    runs-on: ubuntu-latest
    steps:
      - run: |
          aws ecs update-service --cluster app-staging \
            --service api --force-new-deployment

  deploy-prod:
    needs: deploy-staging
    environment: production   # requires named reviewer to approve
    runs-on: ubuntu-latest
    steps:
      - run: |
          aws ecs update-service --cluster app-prod \
            --service api --force-new-deployment

The environment: production gate in GitHub requires a named reviewer to approve before the prod job runs. The OIDC role-to-assume is scoped with an IAM condition to only trust tokens from the main branch — a leaked token from a feature branch cannot trigger a production deployment.

Observability from Day One

CloudWatch dashboards for CPU, memory, and request latency per service. Log groups per service with structured JSON logging so logs are queryable by request ID, user ID, or error type. Alarms for error rate spikes, deployment failures, and database connection exhaustion routed to Slack via SNS. You shouldn't need to SSH into a server to understand why something is slow — the dashboard should tell you. We set retention policies on log groups from the start to avoid unbounded CloudWatch costs.

Architecture at a Glance

What the target-state AWS infrastructure looks like — the resources Terraform provisions and how traffic flows through them:

Target StateAWS Architecture

Internet
└── Route 53 → CloudFront (WAF + CDN)
                └── ALB (HTTPS only, HTTP 301 redirect)
                      └── ECS Fargate (min 2 tasks, spread across AZs)
                            ├── app:$GITHUB_SHA  (ECR image)
                            └── → RDS Aurora PostgreSQL (Multi-AZ, automated backups)
                                  └── → ElastiCache Redis (session store / cache)

CI/CD
└── GitHub Actions (OIDC) → IAM Role (branch-scoped)
      ├── ECR → push image
      └── ECS → update-service (rolling deploy, 0 downtime)

Secrets & State
├── Terraform state  → S3 (versioned) + DynamoDB (state lock)
├── App secrets      → AWS Secrets Manager (injected at runtime, never in env vars)
└── Logs             → CloudWatch Logs (30-day retention, structured JSON)

Tools We Use

Every tool is chosen for operational simplicity — your team will need to operate this after we hand it over.

Terraform GitHub Actions AWS ECR AWS ECS / Fargate AWS RDS CloudWatch DynamoDB (state locking) S3

Common Mistakes We Prevent

Storing long-lived AWS credentials in GitHub Secrets instead of using OIDC federation — one leaked secret is a full account compromise
Creating production-only resources manually in the console and forgetting to add them to Terraform, causing drift that breaks the next terraform apply
Skipping the staging environment and deploying straight from dev to prod, removing the only safety net for catching environment-specific failures
Using a single DynamoDB table or S3 bucket for all environments' Terraform state, which creates a blast radius where a broken prod deployment can corrupt dev state

Key principle: We build for the team that has to maintain this at 2am. Simple, documented, and automated is always better than clever and opaque. Every infrastructure decision we make, we ask: will a mid-level engineer who didn't build this be able to debug it under pressure?

What You Get

Terraform codebase with full AWS infrastructure defined as code, organized by module
Three environment setup (dev / staging / prod) with documented parity and intentional differences
GitHub Actions CI/CD pipeline with automated testing, image builds, and environment-gated deployments
CloudWatch dashboards and Slack alerting configured with meaningful thresholds
S3 + DynamoDB Terraform state backend with per-environment isolation
Architecture diagram and runbooks for common operations (rollback, scaling, database failover)
60-minute handover session with your engineering team and recorded walkthrough

Timeline & What to Expect

Week 1 Discovery session, AWS account audit, architecture design, IaC foundation (VPC, networking, IAM roles, state backend)

Week 2 Environment provisioning (dev/staging/prod), ECS cluster + task definitions, RDS setup, ECR repositories

Week 3 GitHub Actions pipeline build, automated tests integration, staging deployment validation, observability setup

Week 4 Production cutover, runbook documentation, team handover session, 30-day support window begins

After handover, your team owns the infrastructure. We provide 30 days of async support for questions that come up as you get familiar with the setup. Most teams are self-sufficient by week two — the Terraform codebase is designed to be readable, and the runbooks cover the most common operational tasks in plain English.

Frequently Asked Questions

Do we need Kubernetes?

Almost certainly not at the early stage. Kubernetes adds significant operational overhead — cluster management, node group upgrades, networking complexity, and a steep learning curve. ECS Fargate gives you container orchestration without any of that management burden. We recommend Kubernetes when you have multiple teams deploying independent services and need the deployment consistency and ecosystem that comes with it — typically at 10+ engineers and 5+ services.

Can this work with our existing AWS account?

Yes. We start with an account audit to understand what's already there, import existing resources into Terraform where possible, and design the new infrastructure around what you already have. If the existing account is in poor shape (e.g., everything in the default VPC, no IAM structure), we'll surface that early and recommend whether to clean up in place or migrate to a new account structure.

What if we're not on AWS?

We work primarily with AWS because it's where most early-stage startups land and where our tooling expertise is deepest. If you're on GCP or Azure, the principles are identical — Terraform still manages the infrastructure, GitHub Actions still runs the pipeline — but the specific modules and services differ. Reach out and we'll be direct about whether we're the right fit for your cloud provider.

When This Is the Right Fit

This engagement is right for you if you're at the stage where manual deployments are starting to slow you down, you're onboarding more engineers and need repeatable processes, or you're heading toward a Series A and need infrastructure that looks credible to technical due diligence. Teams with 2–15 engineers who are AWS-hosted get the most value. If you already have Terraform but no CI/CD, we can plug in just the pipeline piece without rebuilding everything.

This is not the right fit if you have a dedicated platform engineering team with existing infrastructure standards — in that case you need execution bandwidth, not an architecture engagement. It's also not right if you're pre-product and not yet sure what your infrastructure requirements will be; in that case, start with a minimal serverless setup and engage us when you have clearer requirements.