📊 Design Ad Click Aggregation — System Design Interview Guide

Hard · Data Pipelines & Analytics

Design a real-time ad click aggregation system that counts ad clicks per minute/hour, detects fraud, and provides query capabilities for billing and analytics at scale.

Open the interactive Ad Click Aggregation design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Record every ad click (adId, userId, timestamp, source)
Aggregate click counts per adId per minute and per hour
Query: "how many clicks did ad X get in the last N minutes?"
Detect fraudulent clicks (same user, same ad, < 1 min apart)
Generate billing report: total clicks per advertiser per day
Support filtering by device type, country, ad campaign

Non-functional requirements & scale

10B ad clicks per day (~115K clicks/sec at peak)
Aggregation latency: counts available within 1 minute of click
Query latency for last 1h aggregation < 1 second
Exactly-once counting — no double-billing
Data retention: raw clicks 7 days; aggregates 1 year
System must handle late-arriving events (up to 5 min late)

Capacity estimation

115K clicks/sec = high write throughput. Raw click events must be ingested and aggregated. Two paths: hot path (real-time, Kafka + Flink) for near-real-time aggregates; cold path (batch, S3 + Spark) for accurate billing. Reconcile hot and cold path results for billing.

Core entities

ClickEvent — clickId, adId, userId, campaignId, deviceType, country, timestamp, ipHash
ClickAggregate — adId, windowStart, windowType (minute|hour|day), clickCount, uniqueUsers
FraudFlag — clickId, reason, detectedAt

API design

POST /api/v1/clicks — Record click. Body: { adId, metadata }. Fire-and-forget to Kafka.
GET /api/v1/ads/:adId/clicks?start=&end=&granularity=minute — Query aggregated click counts with time range.
GET /api/v1/campaigns/:id/report — Billing report for a campaign.

High-level design

Click → Kafka → Hot Path (Flink windowed aggregation → Redis/OLAP) + Cold Path (S3 → Spark batch → Data Warehouse). Query API reads from OLAP DB. Fraud detection as separate Flink job.

Deep dives

🌊 Flink Windowed Aggregation

Tumbling window: fixed 1-minute non-overlapping buckets. Sliding window: "last 5 minutes" updated every 30 seconds. Session window: group clicks until 30s of inactivity. Flink counts per (adId, window). Late events: allow 5-min grace period — extend window until watermark passes. Checkpoint state to HDFS/S3 for fault tolerance. Exactly-once via Kafka offset commits + 2-phase commit.

🕵️ Fraud Detection

Real-time: Flink stateful function per (userId, adId). Track last click timestamp per pair. If same user, same ad, < 60s: mark as fraud. IP-based: > 20 clicks/min from same IP = bot. Behavioral: consistent click intervals = scripted bot. Pattern matching via CEP (Complex Event Processing) in Flink. Fraudulent clicks counted but flagged — excluded from billing.

🔄 Lambda Architecture

Hot path: Kafka → Flink → ClickHouse. Available in ~30 seconds. May have small inaccuracies (late events, processing failures). Cold path: S3 raw data → Spark batch job (nightly) → Data Warehouse. 100% accurate. Reconciliation: at billing time, use cold path numbers (authoritative). Hot path for real-time dashboard only. Kappa architecture: replace batch with longer-retention stream — simpler.

💰 Billing Accuracy

Billing requires exactly-once counting. Approach: process raw events in Spark with deduplication (GROUP BY clickId). Any duplicate clickId → count once. Reconcile: billing job reads from S3 (raw, immutable), not from stream aggregates. Idempotent billing run: re-running on same day produces same number. Fraud exclusion: join with fraud flags before counting.

Scaling considerations

Kafka with 100 partitions for click topic — parallel ingestion at 115K/sec
ClickHouse columnar storage for OLAP queries on 10B rows (< 1s aggregation)
Flink auto-scales with Kafka partition count
S3 Parquet files for cold storage — efficient Spark processing
Click dedup: Redis Set with TTL 60s per (userId, adId) for hot path fraud check

What interviewers expect by level

Junior: Describe click event ingestion, Kafka, basic aggregation. Understand why OLAP DB is better than MySQL for analytics.
Mid: Flink windowed aggregation, late event handling, fraud detection, Lambda vs Kappa architecture.
Senior: Exactly-once semantics, reconciliation between hot/cold paths, ClickHouse schema design, billing accuracy.
Staff: Multi-region click collection, privacy (GDPR cookie consent), cost at 10B events/day, A/B test attribution.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…