🏗️ Design Full Production Stack — System Design Interview Guide

Hard · Fundamentals

Design a complete production-ready infrastructure stack for a large-scale web application, covering all layers: CDN, API gateway, microservices, data stores, observability, and CI/CD.

Open the interactive Full Production Stack design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

This is the "big picture" system design. Covers how all components fit together in a real production environment. Key focus: reliability, operability, security, and cost efficiency.

Core entities

API design

High-level design

Internet → CloudFront (CDN + WAF) → ALB → API Gateway → Microservices on EKS. Data: RDS + ElastiCache + S3 + Elasticsearch. Observability: Datadog/CloudWatch. CI/CD: GitHub Actions → ECR → ArgoCD.

Deep dives

🚀 Zero-Downtime Deployment

Rolling deployment: replace pods one-by-one. readinessProbe ensures new pod ready before receiving traffic. Blue-Green: run identical prod (blue) and new version (green). Switch traffic at LB level. Rollback = switch back. Canary: route 5% traffic to new version, monitor error rate, gradually increase or rollback. Feature flags: deploy code "off", enable for % of users without redeploy.

👁️ Observability Stack

Three pillars: (1) Metrics: Datadog/Prometheus — CPU, latency, error rate, saturation. Alert on SLOs. (2) Logs: CloudWatch Logs / Elasticsearch. Structured JSON logs. Correlation ID per request. (3) Traces: Jaeger/X-Ray — end-to-end request path across microservices. Service mesh auto-injects trace headers. Dashboard: SLO burn rate, P99 latency heatmap, error budget.

🔐 Security Layers

WAF (CloudFront): block SQL injection, XSS, bad bots. DDoS: AWS Shield. API Gateway: JWT auth, rate limiting, request validation. mTLS: service-to-service via Istio (mutual certificate verification). Secrets: AWS Secrets Manager or HashiCorp Vault — never in env variables or code. IAM: least-privilege roles per service. VPC: private subnets for databases, only ALB in public subnet.

💰 Cost Optimization

Reserved Instances: 1-3 year commit for stable baseline (40-60% savings). Spot Instances: stateless workers (up to 90% savings). S3 Intelligent-Tiering: auto-move objects between frequent/infrequent/archive. Auto-scaling: scale to zero at night for non-prod. Data transfer: keep traffic within same AZ (inter-AZ = $0.01/GB). CloudFront: reduce origin traffic (cached = cheaper). Monthly cost review with AWS Cost Explorer.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…