🕷️ Design Web Crawler — System Design Interview Guide

Medium · Data Ingestion

Design a distributed web crawler like Googlebot that systematically discovers and downloads web pages at scale, for purposes like search indexing, archiving, or data mining.

Open the interactive Web Crawler design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Crawl billions of web pages starting from seed URLs
Extract links from crawled pages and enqueue for crawling
Respect robots.txt and crawl-delay directives
De-duplicate: don't crawl the same URL twice
Store page content for indexing or archiving
Handle HTTP redirects and error codes appropriately

Non-functional requirements & scale

Crawl 5B pages per day (~58,000 pages/sec)
URL frontier must hold billions of URLs without OOM
Politeness: no more than 1 request/sec per domain
Fault-tolerant: worker crash doesn't lose crawl progress
Scalable: add workers to increase throughput linearly
Avoid spider traps (infinite URL-generating sites)

Capacity estimation

5B pages/day = 58K/sec. Average page = 100KB HTML. Storage: 5B × 100KB = 500TB/day (HTML alone). Need DNS resolution caching (same domain re-resolved wastes time). Politeness = max 1 req/sec per domain = need per-domain rate limiting.

Core entities

URL — url, normalizedUrl, domain, status, depth, discoveredAt, crawledAt
Page — pageId, url, contentHash, htmlContent (S3 key), httpStatus, crawledAt
Domain — domain, robotsTxt, crawlDelay, lastCrawledAt, isBanned

API design

Internal Scheduler → Worker Queue — URL Scheduler dequeues from frontier and assigns batches to workers.
Internal Worker → Content Store — Worker fetches URL, stores HTML in S3, extracts links, acks queue.
GET /admin/stats — Crawl progress: pages/sec, queue depth, error rates per domain.

High-level design

Seed URLs → URL Frontier (priority queue). Scheduler dequeues, routes to worker by domain hash. Worker fetches page → extracts links → dedup check → push new URLs to frontier → store HTML to S3.

Deep dives

🔄 URL Deduplication

Naive: DB lookup for every URL. At 58K URLs/sec with millions of extracted links, this is too slow. Solution: Bloom filter (probabilistic, no false negatives). 10B URLs × 10 bits/entry = 12GB — fits in RAM. False positive rate ~1% (occasionally skip valid URLs — acceptable). Persistent dedup: store URL hashes in Cassandra for exact check before inserting to frontier.

🤝 Politeness & Robots.txt

Politeness: group URLs by domain. Each domain has its own queue. Crawl at most 1 request/sec per domain (configurable). Robots.txt: fetch and cache on first domain visit. Parse Disallow rules — skip matching URLs. Cache robots.txt in Redis (TTL 24h). Respect Crawl-delay header. Ban domains with no response or suspicious behavior.

🌀 Spider Traps

Infinite URL generators: calendar sites (/cal?date=2024-01-01, /cal?date=2024-01-02...). Solutions: (1) Max crawl depth per domain (e.g., depth 5). (2) URL canonicalization — normalize query params, trailing slashes. (3) Hash URL content — if hash matches already-seen page, skip. (4) Domain URL count limit (e.g., max 1M URLs per domain). (5) Detect pattern (/\d{4}-\d{2}-\d{2}/) and throttle.

📦 URL Frontier Priority

Not all URLs are equal. PageRank-like scoring: pages from high-authority sites (nytimes.com) crawled sooner than unknown blogs. Fresh crawling: pages that change frequently (news sites) need re-crawl every hour; Wikipedia pages every week. Use priority queue with multiple tiers. Scheduler: pop from high-priority tier 80% of time, low-priority 20%.

Scaling considerations

Kafka URL frontier: partitioned by domain hash for politeness grouping
Bloom filter sharded in Redis cluster: each shard handles prefix of URL hash
Worker fleet: auto-scales based on Kafka consumer lag
DNS cache in Redis prevents repeated resolution (huge latency saving)
S3 for HTML: cheap, durable, no single point of failure

What interviewers expect by level

Junior: Describe basic crawl loop: fetch → parse links → enqueue. Know robots.txt purpose.
Mid: Bloom filter dedup, domain-based politeness queues, DNS caching, distributed worker pool.
Senior: Priority-based frontier, spider trap detection, URL normalization, at-least-once crawl semantics.
Staff: Full Google-scale design: 5B pages/day, re-crawl scheduling with freshness signals, link graph for PageRank.

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…