📡 Design Monitoring / Metrics (Datadog) — System Design Interview Guide
Hard · Observability & Infrastructure
Design a metrics monitoring and alerting system like Datadog or Prometheus that collects time-series metrics from thousands of services, stores them efficiently, and triggers alerts on threshold violations.
Open the interactive Monitoring / Metrics (Datadog) design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Collect metrics from hosts/services (CPU, memory, latency, errors)
- Store time-series data with millisecond resolution
- Query metrics with aggregations (avg, p99, sum) over time ranges
- Create alert rules: notify if metric > threshold for N minutes
- Dashboard builder: visualize metrics as line/bar charts
- Support custom metrics from applications
Non-functional requirements & scale
- Ingest 10M metrics data points per second
- Query last 24h of any metric in < 3 seconds
- Retention: 15s resolution for 15 days; 5min resolution for 2 years
- Alert latency: violation → notification < 60 seconds
- Storage efficiency: compress metrics 10× vs raw
- 99.9% query availability
Capacity estimation
10M data points/sec = massive write throughput. Time-series DB optimized for this (Prometheus, InfluxDB, VictoriaMetrics). Resolution downsampling for long-term retention. Alert evaluation: run query every 60s, compare against threshold.
Core entities
- Metric — name (e.g. cpu.usage), tags {host, service, env}, datapoints[(timestamp, value)]
- AlertRule — ruleId, metricQuery, condition, threshold, duration, channels[], status
- AlertEvent — alertId, ruleId, triggeredAt, resolvedAt?, value, message
- Dashboard — dashboardId, title, panels[{query, vizType, title}]
API design
POST /api/v1/series— Ingest metrics. Body: [{ metric, tags, points[[ts, value]] }]. Batch, compressed.GET /api/v1/query?q=avg:cpu.usage{host:web-01}&from=-1h— Query time-series data with aggregation.POST /api/v1/alerts— Create alert rule. Body: { metric, condition, threshold, channels[] }.
High-level design
Agents on hosts push metrics to Ingest Service → Kafka → Storage Workers write to time-series DB. Query API reads from TSDB with downsampling. Alert Evaluator runs queries every 60s → triggers notifications via Kafka.
Deep dives
🗜️ Time-Series Compression
Raw: 10M points/sec × 16 bytes = 160MB/s. Too much to store. Gorilla compression (Facebook): XOR delta-of-deltas for timestamps (timestamps are regular → small deltas). XOR for values (consecutive readings similar → many bits cancel). Result: 1.37 bits/value avg → 10× compression. TSDB like InfluxDB/Prometheus implements this natively. Long-term: downsample to 5-min average after 15 days.
⏰ Alert Evaluation
Alert Evaluator: distributed cron. Each evaluator owns a set of alert rules. Every 60s: run metric query for each rule → compare against threshold. For duration-based alerts ("CPU > 90% for 5 min"): track consecutive violations in Redis. Alert fires when violation count × interval >= duration. Re-evaluate after resolution to send "recovered" notification.
📈 Query Performance
Queries like "avg CPU across 1000 hosts over last 24h": naive = read 1000 time series × 86,400 points = 86M points. Optimization: pre-aggregate by tag (store host-group averages). Rollup tables: for queries spanning > 1h, use 5-min aggregates instead of 15s raw. Column-oriented storage: reading all values for one metric is sequential I/O (fast).
🔖 Tags & Cardinality
Each metric has tags: {host, service, region, env}. Cardinality = unique combinations. 1000 hosts × 100 services × 3 regions × 2 envs = 600,000 unique time series. High cardinality = more storage + slower queries. Problem: dynamic tags like request_id, user_id → millions of series. Rule: tags should have bounded cardinality. Warn users adding high-cardinality tags.
Scaling considerations
- Kafka absorbs write bursts before storage (acts as buffer)
- VictoriaMetrics: single-node handles 1M metrics/sec, cluster for more
- Partition TSDB by time range — hot data on SSD, cold on HDD/S3
- Query caching: cache aggregated results in Redis (TTL = resolution)
- Federation: per-datacenter Prometheus, global aggregator for cross-DC
What interviewers expect by level
- Junior: Describe metrics collection from agents, storage, and basic alerting. Know what a time-series DB is.
- Mid: Kafka buffer for ingestion, Gorilla compression, alert evaluation cron, rollup tables for query performance.
- Senior: TSDB internals, cardinality problem, distributed alert evaluation, multi-tenant isolation.
- Staff: Global scale (Datadog: 20T metrics/day), cost optimization with intelligent tiering, SLO/SLA monitoring, AIOps anomaly detection.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Web Crawler
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…