🏗️ Design Full Production Stack — System Design Interview Guide

Hard · Fundamentals

Design a complete production-ready infrastructure stack for a large-scale web application, covering all layers: CDN, API gateway, microservices, data stores, observability, and CI/CD.

Open the interactive Full Production Stack design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Serve millions of users across multiple regions
Microservices communicating via REST and gRPC
Multiple data stores: relational, cache, search, object storage
CI/CD pipeline: code push to production in < 30 minutes
Observability: metrics, logs, traces for every service
Secrets management and configuration management

Non-functional requirements & scale

Zero-downtime deployments (rolling or blue-green)
99.99% availability across all services
Auto-scaling: respond to traffic within 2 minutes
Security: all traffic encrypted, no secrets in code
Mean time to recovery (MTTR) < 5 minutes for P0 incidents
Cost monitoring and budget alerts

Capacity estimation

This is the "big picture" system design. Covers how all components fit together in a real production environment. Key focus: reliability, operability, security, and cost efficiency.

Core entities

Service — Docker container + Kubernetes Deployment. Stateless, health-checked, auto-scaled.
DataStore — Managed service (RDS, ElastiCache, Elasticsearch). Automated backups, multi-AZ.
Pipeline — GitHub → CI (build/test) → CD (deploy to staging → prod). Canary releases.

API design

DNS app.example.com → CloudFront → ALB → ECS/EKS — Request routing through full stack.
Internal Service Mesh (Istio/Envoy) — mTLS between all services, circuit breaking, observability.

High-level design

Internet → CloudFront (CDN + WAF) → ALB → API Gateway → Microservices on EKS. Data: RDS + ElastiCache + S3 + Elasticsearch. Observability: Datadog/CloudWatch. CI/CD: GitHub Actions → ECR → ArgoCD.

Deep dives

🚀 Zero-Downtime Deployment

Rolling deployment: replace pods one-by-one. readinessProbe ensures new pod ready before receiving traffic. Blue-Green: run identical prod (blue) and new version (green). Switch traffic at LB level. Rollback = switch back. Canary: route 5% traffic to new version, monitor error rate, gradually increase or rollback. Feature flags: deploy code "off", enable for % of users without redeploy.

👁️ Observability Stack

Three pillars: (1) Metrics: Datadog/Prometheus — CPU, latency, error rate, saturation. Alert on SLOs. (2) Logs: CloudWatch Logs / Elasticsearch. Structured JSON logs. Correlation ID per request. (3) Traces: Jaeger/X-Ray — end-to-end request path across microservices. Service mesh auto-injects trace headers. Dashboard: SLO burn rate, P99 latency heatmap, error budget.

🔐 Security Layers

WAF (CloudFront): block SQL injection, XSS, bad bots. DDoS: AWS Shield. API Gateway: JWT auth, rate limiting, request validation. mTLS: service-to-service via Istio (mutual certificate verification). Secrets: AWS Secrets Manager or HashiCorp Vault — never in env variables or code. IAM: least-privilege roles per service. VPC: private subnets for databases, only ALB in public subnet.

💰 Cost Optimization

Reserved Instances: 1-3 year commit for stable baseline (40-60% savings). Spot Instances: stateless workers (up to 90% savings). S3 Intelligent-Tiering: auto-move objects between frequent/infrequent/archive. Auto-scaling: scale to zero at night for non-prod. Data transfer: keep traffic within same AZ (inter-AZ = $0.01/GB). CloudFront: reduce origin traffic (cached = cheaper). Monthly cost review with AWS Cost Explorer.

Scaling considerations

Kubernetes HPA (Horizontal Pod Autoscaler) scales pods based on CPU/custom metrics
Multi-AZ deployment for all stateful resources (RDS, ElastiCache)
CloudFront reduces origin load by 80%+ for cacheable content
Service mesh (Istio): circuit breaking, retry, timeout policies per service
Chaos Engineering: deliberately inject failures to test resilience (Netflix Chaos Monkey)

What interviewers expect by level

Junior: Describe 3-tier web architecture (CDN, servers, DB). Know basic CI/CD flow.
Mid: Kubernetes deployment, managed services (RDS vs self-managed), basic observability, zero-downtime deployment.
Senior: Service mesh, canary deployments, SLO/error-budget management, multi-region active-active, security layers.
Staff: FinOps cost optimization, organization-wide platform engineering, multi-cloud strategy, compliance (SOC2/ISO27001).

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…