🏗️ Design Full Production Stack — System Design Interview Guide
Hard · Fundamentals
Design a complete production-ready infrastructure stack for a large-scale web application, covering all layers: CDN, API gateway, microservices, data stores, observability, and CI/CD.
Open the interactive Full Production Stack design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Serve millions of users across multiple regions
- Microservices communicating via REST and gRPC
- Multiple data stores: relational, cache, search, object storage
- CI/CD pipeline: code push to production in < 30 minutes
- Observability: metrics, logs, traces for every service
- Secrets management and configuration management
Non-functional requirements & scale
- Zero-downtime deployments (rolling or blue-green)
- 99.99% availability across all services
- Auto-scaling: respond to traffic within 2 minutes
- Security: all traffic encrypted, no secrets in code
- Mean time to recovery (MTTR) < 5 minutes for P0 incidents
- Cost monitoring and budget alerts
Capacity estimation
This is the "big picture" system design. Covers how all components fit together in a real production environment. Key focus: reliability, operability, security, and cost efficiency.
Core entities
- Service — Docker container + Kubernetes Deployment. Stateless, health-checked, auto-scaled.
- DataStore — Managed service (RDS, ElastiCache, Elasticsearch). Automated backups, multi-AZ.
- Pipeline — GitHub → CI (build/test) → CD (deploy to staging → prod). Canary releases.
API design
DNS app.example.com → CloudFront → ALB → ECS/EKS— Request routing through full stack.Internal Service Mesh (Istio/Envoy)— mTLS between all services, circuit breaking, observability.
High-level design
Internet → CloudFront (CDN + WAF) → ALB → API Gateway → Microservices on EKS. Data: RDS + ElastiCache + S3 + Elasticsearch. Observability: Datadog/CloudWatch. CI/CD: GitHub Actions → ECR → ArgoCD.
Deep dives
🚀 Zero-Downtime Deployment
Rolling deployment: replace pods one-by-one. readinessProbe ensures new pod ready before receiving traffic. Blue-Green: run identical prod (blue) and new version (green). Switch traffic at LB level. Rollback = switch back. Canary: route 5% traffic to new version, monitor error rate, gradually increase or rollback. Feature flags: deploy code "off", enable for % of users without redeploy.
👁️ Observability Stack
Three pillars: (1) Metrics: Datadog/Prometheus — CPU, latency, error rate, saturation. Alert on SLOs. (2) Logs: CloudWatch Logs / Elasticsearch. Structured JSON logs. Correlation ID per request. (3) Traces: Jaeger/X-Ray — end-to-end request path across microservices. Service mesh auto-injects trace headers. Dashboard: SLO burn rate, P99 latency heatmap, error budget.
🔐 Security Layers
WAF (CloudFront): block SQL injection, XSS, bad bots. DDoS: AWS Shield. API Gateway: JWT auth, rate limiting, request validation. mTLS: service-to-service via Istio (mutual certificate verification). Secrets: AWS Secrets Manager or HashiCorp Vault — never in env variables or code. IAM: least-privilege roles per service. VPC: private subnets for databases, only ALB in public subnet.
💰 Cost Optimization
Reserved Instances: 1-3 year commit for stable baseline (40-60% savings). Spot Instances: stateless workers (up to 90% savings). S3 Intelligent-Tiering: auto-move objects between frequent/infrequent/archive. Auto-scaling: scale to zero at night for non-prod. Data transfer: keep traffic within same AZ (inter-AZ = $0.01/GB). CloudFront: reduce origin traffic (cached = cheaper). Monthly cost review with AWS Cost Explorer.
Scaling considerations
- Kubernetes HPA (Horizontal Pod Autoscaler) scales pods based on CPU/custom metrics
- Multi-AZ deployment for all stateful resources (RDS, ElastiCache)
- CloudFront reduces origin load by 80%+ for cacheable content
- Service mesh (Istio): circuit breaking, retry, timeout policies per service
- Chaos Engineering: deliberately inject failures to test resilience (Netflix Chaos Monkey)
What interviewers expect by level
- Junior: Describe 3-tier web architecture (CDN, servers, DB). Know basic CI/CD flow.
- Mid: Kubernetes deployment, managed services (RDS vs self-managed), basic observability, zero-downtime deployment.
- Senior: Service mesh, canary deployments, SLO/error-budget management, multi-region active-active, security layers.
- Staff: FinOps cost optimization, organization-wide platform engineering, multi-cloud strategy, compliance (SOC2/ISO27001).
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…