📡 Design Monitoring / Metrics (Datadog) — System Design Interview Guide

Hard · Observability & Infrastructure

Design a metrics monitoring and alerting system like Datadog or Prometheus that collects time-series metrics from thousands of services, stores them efficiently, and triggers alerts on threshold violations.

Open the interactive Monitoring / Metrics (Datadog) design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

10M data points/sec = massive write throughput. Time-series DB optimized for this (Prometheus, InfluxDB, VictoriaMetrics). Resolution downsampling for long-term retention. Alert evaluation: run query every 60s, compare against threshold.

Core entities

API design

High-level design

Agents on hosts push metrics to Ingest Service → Kafka → Storage Workers write to time-series DB. Query API reads from TSDB with downsampling. Alert Evaluator runs queries every 60s → triggers notifications via Kafka.

Deep dives

🗜️ Time-Series Compression

Raw: 10M points/sec × 16 bytes = 160MB/s. Too much to store. Gorilla compression (Facebook): XOR delta-of-deltas for timestamps (timestamps are regular → small deltas). XOR for values (consecutive readings similar → many bits cancel). Result: 1.37 bits/value avg → 10× compression. TSDB like InfluxDB/Prometheus implements this natively. Long-term: downsample to 5-min average after 15 days.

⏰ Alert Evaluation

Alert Evaluator: distributed cron. Each evaluator owns a set of alert rules. Every 60s: run metric query for each rule → compare against threshold. For duration-based alerts ("CPU > 90% for 5 min"): track consecutive violations in Redis. Alert fires when violation count × interval >= duration. Re-evaluate after resolution to send "recovered" notification.

📈 Query Performance

Queries like "avg CPU across 1000 hosts over last 24h": naive = read 1000 time series × 86,400 points = 86M points. Optimization: pre-aggregate by tag (store host-group averages). Rollup tables: for queries spanning > 1h, use 5-min aggregates instead of 15s raw. Column-oriented storage: reading all values for one metric is sequential I/O (fast).

🔖 Tags & Cardinality

Each metric has tags: {host, service, region, env}. Cardinality = unique combinations. 1000 hosts × 100 services × 3 regions × 2 envs = 600,000 unique time series. High cardinality = more storage + slower queries. Problem: dynamic tags like request_id, user_id → millions of series. Rule: tags should have bounded cardinality. Warn users adding high-cardinality tags.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…