Kubernetes Migration + Container Orchestration for Microservices

The Challenge

Kubernetes migrations fail most often because teams try to lift-and-shift without understanding the operational model shift. VMs and containers think differently about networking, storage, configuration, and lifecycle management. The complexity compounds when you have stateful services, shared databases, or services that weren't designed with container constraints in mind.

We've seen teams spend six months on a migration that stalled because they didn't plan for persistent volume management, or because their CI/CD pipeline was hardcoded to SSH deployment scripts that couldn't adapt to Helm releases. Kubernetes also introduces a new failure mode: the platform itself can have misconfigurations that silently undermine the security of every workload running on it — open network policies, overly permissive RBAC, or service accounts with cluster-admin rights granted "temporarily."

Our approach avoids those traps by assessing thoroughly before touching anything in production. We treat the migration as a series of small, validated steps — not a big-bang cutover that requires everything to work perfectly on day one.

Signs You Need This

Deployment times exceed 10–15 minutes because there's no parallelism in your current process
Your services are already containerized but running on VM-based deployments or ECS and you're hitting operational scaling limits
You need per-service auto-scaling but your current platform doesn't support it granularly
Multiple teams are deploying independently and there's no consistent deployment standard across services
Your infrastructure costs are growing faster than your service count because of inefficient VM-per-service allocation
You're adopting GitOps practices and need a platform that supports declarative continuous deployment

How We Approach It

Service Inventory & Dependency Mapping

We map every service — what it does, how it communicates (sync REST/gRPC vs. async message queues), what state it holds, what its resource profile looks like under load, and how it currently deploys. Stateful services (databases, caches, message brokers) are identified early because they need a fundamentally different migration path than stateless API services. We also identify cross-service dependencies that could create a "migration order problem" — services that can't migrate until their downstream dependencies are already on the cluster.

Containerization Audit & Optimization

We review existing Dockerfiles (or write them from scratch), optimize base images for size and security using distroless or Alpine bases, enforce non-root user constraints, and remove secrets from build layers. Images are scanned with Trivy for OS-level CVEs and misconfigured package versions before going near the registry. A lean, secure image is the foundation of a reliable deployment — oversized images with root-level processes are a security and operational liability.

A production-hardened Node.js Dockerfile using multi-stage build and a non-root user:

Dockerfile — Multi-Stage, Non-Root, MinimalDOCKERFILE

# Stage 1: Build dependencies
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production   # install without devDependencies

# Stage 2: Runtime — use distroless for smallest attack surface
FROM gcr.io/distroless/nodejs20-debian12 AS runtime
WORKDIR /app

# Copy only the production artifacts from builder
COPY --from=builder /app/node_modules ./node_modules
COPY --chown=nonroot:nonroot src/ ./src/

# Run as non-root user (distroless includes nonroot at UID 65532)
USER nonroot

EXPOSE 3000
# Use exec form to avoid shell interpretation (PID 1 = node, not sh)
ENTRYPOINT ["node", "src/index.js"]

# Trivy scan in CI before pushing to ECR:
# trivy image --exit-code 1 --severity HIGH,CRITICAL myapp:latest

Cluster Setup on EKS / GKE with Terraform

We provision the cluster with managed node groups, configure cluster autoscaler and HPA, set up network policies for pod-to-pod traffic control (default-deny with explicit allows per service), configure IRSA (IAM Roles for Service Accounts) so pods get AWS permissions via temporary credentials rather than instance profiles, and deploy an NGINX Ingress controller with TLS termination. All cluster configuration lives in Terraform — no manual kubectl apply commands that disappear from the audit trail.

Helm Charts for Every Service + ArgoCD GitOps

Each service gets a Helm chart — Deployment, Service, HPA, ConfigMap, and Ingress resources. Secrets are injected via AWS Secrets Manager or Vault using the External Secrets Operator, not hardcoded in values files. We establish a chart structure convention with a base chart and environment-specific overrides so your team can maintain and extend charts without deep Kubernetes expertise. ArgoCD provides GitOps-style continuous deployment: the cluster's desired state is always the contents of the Git repository, and ArgoCD reconciles any drift automatically.

Zero-Downtime Migration with Traffic Shifting

We use a parallel-run strategy: new services deploy on Kubernetes while old instances stay up. Traffic is shifted gradually using DNS weighted routing or ALB target group weights — starting at 5% to the new deployment, validating error rates and latency, then incrementally shifting to 100%. Each service is validated independently using smoke tests and real traffic monitoring before the old instance is decommissioned. No big-bang cutovers, and rollback is a DNS change or a weight adjustment — not an emergency hotfix deployment.

Helm + ArgoCD GitOps in Practice

Every service gets a parameterized Helm chart. ArgoCD watches the Git repo and reconciles the cluster to match — no manual kubectl apply in production:

charts/api-service/values.yaml + argocd-app.yamlYAML

# Helm values — environment-specific overrides in values-prod.yaml
replicaCount: 3

image:
  repository: 123456789.dkr.ecr.us-east-1.amazonaws.com/api-service
  tag: ""          # injected by CI: --set image.tag=${{ github.sha }}
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 3000

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

# Secrets injected via External Secrets Operator (not in values)
# They sync from AWS Secrets Manager at pod startup

---
# ArgoCD Application — GitOps reconciliation for production
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-service-prod
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/myorg/infra
    targetRevision: main
    path: charts/api-service
    helm:
      valueFiles:
        - values.yaml
        - values-prod.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true       # remove resources deleted from Git
      selfHeal: true    # revert manual kubectl changes
    syncOptions:
      - CreateNamespace=true

Tools We Use

Our Kubernetes migration toolkit is built around tools your team can operate long-term without specialist knowledge.

EKS / GKE Helm Terraform kubectl Trivy GitHub Actions NGINX Ingress Prometheus Grafana ArgoCD

Common Mistakes We Prevent

Migrating all services simultaneously instead of incrementally — one bad Helm chart misconfiguration shouldn't bring down ten services
Granting cluster-admin RBAC to application service accounts because it was faster than defining a precise permissions policy
Using default network policies (allow-all) that negate one of Kubernetes' primary security benefits — network segmentation between services
Storing secrets in ConfigMaps or baking them into container images instead of using a proper secrets management integration

Migration order that works: Always migrate one service at a time, starting with the least critical, most stateless service in your stack. Confidence builds with each successful migration. By the time you get to core services, your team knows the platform and can execute the migration without stress.

What You Get

EKS/GKE cluster provisioned via Terraform with autoscaling, network policies, and IRSA configured
Helm charts for all migrated services with documented values structure and environment overrides
ArgoCD GitOps setup with application definitions and sync policies for each service
GitHub Actions CI/CD pipeline integrated with Helm deployments and ArgoCD image updates
Prometheus + Grafana observability stack with per-service dashboards and alerting rules
Migration runbook, rollback procedures, and Kubernetes operations guide for your team
Team training session covering day-to-day operations, debugging, and common cluster issues

Timeline & What to Expect

Week 1–2 Service inventory, dependency mapping, Dockerfile audit/optimization, Trivy scanning, cluster architecture design

Week 3–4 EKS/GKE cluster provisioning via Terraform, network policies, IRSA, ingress controller, Helm chart development

Week 5–6 Service-by-service migration with traffic shifting, parallel validation, observability stack deployment

Week 7 Production cutover, ArgoCD GitOps setup, documentation, team handover and training session

After the migration, your team operates the cluster independently. The Helm chart structure and ArgoCD setup are designed to make routine deployments a Git push — no Kubernetes expertise required for day-to-day operations. We remain available for async support during the 30-day stabilization window.

Frequently Asked Questions

How long does a typical migration take?

For a team with 5–15 microservices already containerized, the full migration takes 6–8 weeks. If services aren't yet containerized, add 1–2 weeks for Dockerization. The timeline is driven primarily by the number of services and the complexity of stateful dependencies — a monolith split takes longer than migrating independent microservices.

What about stateful services like databases?

We generally recommend keeping databases outside Kubernetes — running RDS, ElastiCache, or managed database services rather than Postgres in a StatefulSet. Managed services give you automated backups, failover, and maintenance without the operational complexity of persistent volumes, storage classes, and stateful pod scheduling. We'll migrate stateless services to Kubernetes and keep stateful services on managed AWS offerings unless there's a compelling reason otherwise.

Will there be downtime during the migration?

No — we use the parallel-run traffic shifting approach specifically to avoid downtime. Each service runs in both the old and new environment simultaneously, with traffic gradually shifted and monitored. If we see error rate increases or latency regressions, we shift traffic back immediately. Downtime is only a risk if you do a big-bang cutover, which we explicitly avoid.

When This Is the Right Fit

Kubernetes is the right investment if you're hitting VM scaling limits, if your deployment pipeline takes more than 10 minutes per service, if you have independent teams deploying independent services who need deployment isolation, or if you're already containerized and need proper orchestration beyond what ECS Fargate provides. The break-even point in operational efficiency is typically at 5+ microservices with meaningful traffic.

This is not the right fit if you have a monolith that hasn't been decomposed — migrating a monolith to Kubernetes adds operational complexity without the benefits of service independence. It's also not right if your team has no Kubernetes experience and no appetite to develop it; in that case, ECS Fargate gives you 80% of the benefit with a fraction of the operational overhead.