📡 Design Monitoring / Metrics (Datadog) — System Design Interview Guide

Hard · Observability & Infrastructure

Design a metrics monitoring and alerting system like Datadog or Prometheus that collects time-series metrics from thousands of services, stores them efficiently, and triggers alerts on threshold violations.

Open the interactive Monitoring / Metrics (Datadog) design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Collect metrics from hosts/services (CPU, memory, latency, errors)
Store time-series data with millisecond resolution
Query metrics with aggregations (avg, p99, sum) over time ranges
Create alert rules: notify if metric > threshold for N minutes
Dashboard builder: visualize metrics as line/bar charts
Support custom metrics from applications

Non-functional requirements & scale

Ingest 10M metrics data points per second
Query last 24h of any metric in < 3 seconds
Retention: 15s resolution for 15 days; 5min resolution for 2 years
Alert latency: violation → notification < 60 seconds
Storage efficiency: compress metrics 10× vs raw
99.9% query availability

Capacity estimation

10M data points/sec = massive write throughput. Time-series DB optimized for this (Prometheus, InfluxDB, VictoriaMetrics). Resolution downsampling for long-term retention. Alert evaluation: run query every 60s, compare against threshold.

Core entities

Metric — name (e.g. cpu.usage), tags {host, service, env}, datapoints[(timestamp, value)]
AlertRule — ruleId, metricQuery, condition, threshold, duration, channels[], status
AlertEvent — alertId, ruleId, triggeredAt, resolvedAt?, value, message
Dashboard — dashboardId, title, panels[{query, vizType, title}]

API design

POST /api/v1/series — Ingest metrics. Body: [{ metric, tags, points[[ts, value]] }]. Batch, compressed.
GET /api/v1/query?q=avg:cpu.usage{host:web-01}&from=-1h — Query time-series data with aggregation.
POST /api/v1/alerts — Create alert rule. Body: { metric, condition, threshold, channels[] }.

High-level design

Agents on hosts push metrics to Ingest Service → Kafka → Storage Workers write to time-series DB. Query API reads from TSDB with downsampling. Alert Evaluator runs queries every 60s → triggers notifications via Kafka.

Deep dives

🗜️ Time-Series Compression

Raw: 10M points/sec × 16 bytes = 160MB/s. Too much to store. Gorilla compression (Facebook): XOR delta-of-deltas for timestamps (timestamps are regular → small deltas). XOR for values (consecutive readings similar → many bits cancel). Result: 1.37 bits/value avg → 10× compression. TSDB like InfluxDB/Prometheus implements this natively. Long-term: downsample to 5-min average after 15 days.

⏰ Alert Evaluation

Alert Evaluator: distributed cron. Each evaluator owns a set of alert rules. Every 60s: run metric query for each rule → compare against threshold. For duration-based alerts ("CPU > 90% for 5 min"): track consecutive violations in Redis. Alert fires when violation count × interval >= duration. Re-evaluate after resolution to send "recovered" notification.

📈 Query Performance

Queries like "avg CPU across 1000 hosts over last 24h": naive = read 1000 time series × 86,400 points = 86M points. Optimization: pre-aggregate by tag (store host-group averages). Rollup tables: for queries spanning > 1h, use 5-min aggregates instead of 15s raw. Column-oriented storage: reading all values for one metric is sequential I/O (fast).

🔖 Tags & Cardinality

Each metric has tags: {host, service, region, env}. Cardinality = unique combinations. 1000 hosts × 100 services × 3 regions × 2 envs = 600,000 unique time series. High cardinality = more storage + slower queries. Problem: dynamic tags like request_id, user_id → millions of series. Rule: tags should have bounded cardinality. Warn users adding high-cardinality tags.

Scaling considerations

Kafka absorbs write bursts before storage (acts as buffer)
VictoriaMetrics: single-node handles 1M metrics/sec, cluster for more
Partition TSDB by time range — hot data on SSD, cold on HDD/S3
Query caching: cache aggregated results in Redis (TTL = resolution)
Federation: per-datacenter Prometheus, global aggregator for cross-DC

What interviewers expect by level

Junior: Describe metrics collection from agents, storage, and basic alerting. Know what a time-series DB is.
Mid: Kafka buffer for ingestion, Gorilla compression, alert evaluation cron, rollup tables for query performance.
Senior: TSDB internals, cardinality problem, distributed alert evaluation, multi-tenant isolation.
Staff: Global scale (Datadog: 20T metrics/day), cost optimization with intelligent tiering, SLO/SLA monitoring, AIOps anomaly detection.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…