The Challenge
Kubernetes migrations fail most often because teams try to lift-and-shift without understanding the operational model shift. VMs and containers think differently about networking, storage, configuration, and lifecycle management. The complexity compounds when you have stateful services, shared databases, or services that weren't designed with container constraints in mind.
We've seen teams spend six months on a migration that stalled because they didn't plan for persistent volume management, or because their CI/CD pipeline was hardcoded to SSH deployment scripts that couldn't adapt to Helm releases. Kubernetes also introduces a new failure mode: the platform itself can have misconfigurations that silently undermine the security of every workload running on it — open network policies, overly permissive RBAC, or service accounts with cluster-admin rights granted "temporarily."
Our approach avoids those traps by assessing thoroughly before touching anything in production. We treat the migration as a series of small, validated steps — not a big-bang cutover that requires everything to work perfectly on day one.
Signs You Need This
- Deployment times exceed 10–15 minutes because there's no parallelism in your current process
- Your services are already containerized but running on VM-based deployments or ECS and you're hitting operational scaling limits
- You need per-service auto-scaling but your current platform doesn't support it granularly
- Multiple teams are deploying independently and there's no consistent deployment standard across services
- Your infrastructure costs are growing faster than your service count because of inefficient VM-per-service allocation
- You're adopting GitOps practices and need a platform that supports declarative continuous deployment
How We Approach It
Service Inventory & Dependency Mapping
We map every service — what it does, how it communicates (sync REST/gRPC vs. async message queues), what state it holds, what its resource profile looks like under load, and how it currently deploys. Stateful services (databases, caches, message brokers) are identified early because they need a fundamentally different migration path than stateless API services. We also identify cross-service dependencies that could create a "migration order problem" — services that can't migrate until their downstream dependencies are already on the cluster.
Containerization Audit & Optimization
We review existing Dockerfiles (or write them from scratch), optimize base images for size and security using distroless or Alpine bases, enforce non-root user constraints, and remove secrets from build layers. Images are scanned with Trivy for OS-level CVEs and misconfigured package versions before going near the registry. A lean, secure image is the foundation of a reliable deployment — oversized images with root-level processes are a security and operational liability.
A production-hardened Node.js Dockerfile using multi-stage build and a non-root user:
# Stage 1: Build dependencies
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production # install without devDependencies
# Stage 2: Runtime — use distroless for smallest attack surface
FROM gcr.io/distroless/nodejs20-debian12 AS runtime
WORKDIR /app
# Copy only the production artifacts from builder
COPY --from=builder /app/node_modules ./node_modules
COPY --chown=nonroot:nonroot src/ ./src/
# Run as non-root user (distroless includes nonroot at UID 65532)
USER nonroot
EXPOSE 3000
# Use exec form to avoid shell interpretation (PID 1 = node, not sh)
ENTRYPOINT ["node", "src/index.js"]
# Trivy scan in CI before pushing to ECR:
# trivy image --exit-code 1 --severity HIGH,CRITICAL myapp:latest
Cluster Setup on EKS / GKE with Terraform
We provision the cluster with managed node groups, configure cluster autoscaler and HPA, set up network policies for pod-to-pod traffic control (default-deny with explicit allows per service), configure IRSA (IAM Roles for Service Accounts) so pods get AWS permissions via temporary credentials rather than instance profiles, and deploy an NGINX Ingress controller with TLS termination. All cluster configuration lives in Terraform — no manual kubectl apply commands that disappear from the audit trail.
Helm Charts for Every Service + ArgoCD GitOps
Each service gets a Helm chart — Deployment, Service, HPA, ConfigMap, and Ingress resources. Secrets are injected via AWS Secrets Manager or Vault using the External Secrets Operator, not hardcoded in values files. We establish a chart structure convention with a base chart and environment-specific overrides so your team can maintain and extend charts without deep Kubernetes expertise. ArgoCD provides GitOps-style continuous deployment: the cluster's desired state is always the contents of the Git repository, and ArgoCD reconciles any drift automatically.
Zero-Downtime Migration with Traffic Shifting
We use a parallel-run strategy: new services deploy on Kubernetes while old instances stay up. Traffic is shifted gradually using DNS weighted routing or ALB target group weights — starting at 5% to the new deployment, validating error rates and latency, then incrementally shifting to 100%. Each service is validated independently using smoke tests and real traffic monitoring before the old instance is decommissioned. No big-bang cutovers, and rollback is a DNS change or a weight adjustment — not an emergency hotfix deployment.
Helm + ArgoCD GitOps in Practice
Every service gets a parameterized Helm chart. ArgoCD watches the Git repo and reconciles the cluster to match — no manual kubectl apply in production:
# Helm values — environment-specific overrides in values-prod.yaml
replicaCount: 3
image:
repository: 123456789.dkr.ecr.us-east-1.amazonaws.com/api-service
tag: "" # injected by CI: --set image.tag=${{ github.sha }}
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 3000
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
# Secrets injected via External Secrets Operator (not in values)
# They sync from AWS Secrets Manager at pod startup
---
# ArgoCD Application — GitOps reconciliation for production
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: api-service-prod
namespace: argocd
spec:
project: production
source:
repoURL: https://github.com/myorg/infra
targetRevision: main
path: charts/api-service
helm:
valueFiles:
- values.yaml
- values-prod.yaml
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # remove resources deleted from Git
selfHeal: true # revert manual kubectl changes
syncOptions:
- CreateNamespace=true
Tools We Use
Our Kubernetes migration toolkit is built around tools your team can operate long-term without specialist knowledge.
Common Mistakes We Prevent
- Migrating all services simultaneously instead of incrementally — one bad Helm chart misconfiguration shouldn't bring down ten services
- Granting cluster-admin RBAC to application service accounts because it was faster than defining a precise permissions policy
- Using default network policies (allow-all) that negate one of Kubernetes' primary security benefits — network segmentation between services
- Storing secrets in ConfigMaps or baking them into container images instead of using a proper secrets management integration
Migration order that works: Always migrate one service at a time, starting with the least critical, most stateless service in your stack. Confidence builds with each successful migration. By the time you get to core services, your team knows the platform and can execute the migration without stress.
What You Get
- EKS/GKE cluster provisioned via Terraform with autoscaling, network policies, and IRSA configured
- Helm charts for all migrated services with documented values structure and environment overrides
- ArgoCD GitOps setup with application definitions and sync policies for each service
- GitHub Actions CI/CD pipeline integrated with Helm deployments and ArgoCD image updates
- Prometheus + Grafana observability stack with per-service dashboards and alerting rules
- Migration runbook, rollback procedures, and Kubernetes operations guide for your team
- Team training session covering day-to-day operations, debugging, and common cluster issues
Timeline & What to Expect
After the migration, your team operates the cluster independently. The Helm chart structure and ArgoCD setup are designed to make routine deployments a Git push — no Kubernetes expertise required for day-to-day operations. We remain available for async support during the 30-day stabilization window.
Frequently Asked Questions
How long does a typical migration take?
For a team with 5–15 microservices already containerized, the full migration takes 6–8 weeks. If services aren't yet containerized, add 1–2 weeks for Dockerization. The timeline is driven primarily by the number of services and the complexity of stateful dependencies — a monolith split takes longer than migrating independent microservices.
What about stateful services like databases?
We generally recommend keeping databases outside Kubernetes — running RDS, ElastiCache, or managed database services rather than Postgres in a StatefulSet. Managed services give you automated backups, failover, and maintenance without the operational complexity of persistent volumes, storage classes, and stateful pod scheduling. We'll migrate stateless services to Kubernetes and keep stateful services on managed AWS offerings unless there's a compelling reason otherwise.
Will there be downtime during the migration?
No — we use the parallel-run traffic shifting approach specifically to avoid downtime. Each service runs in both the old and new environment simultaneously, with traffic gradually shifted and monitored. If we see error rate increases or latency regressions, we shift traffic back immediately. Downtime is only a risk if you do a big-bang cutover, which we explicitly avoid.
When This Is the Right Fit
Kubernetes is the right investment if you're hitting VM scaling limits, if your deployment pipeline takes more than 10 minutes per service, if you have independent teams deploying independent services who need deployment isolation, or if you're already containerized and need proper orchestration beyond what ECS Fargate provides. The break-even point in operational efficiency is typically at 5+ microservices with meaningful traffic.
This is not the right fit if you have a monolith that hasn't been decomposed — migrating a monolith to Kubernetes adds operational complexity without the benefits of service independence. It's also not right if your team has no Kubernetes experience and no appetite to develop it; in that case, ECS Fargate gives you 80% of the benefit with a fraction of the operational overhead.