🕷️ Design Web Crawler — System Design Interview Guide

Medium · Data Ingestion

Design a distributed web crawler like Googlebot that systematically discovers and downloads web pages at scale, for purposes like search indexing, archiving, or data mining.

Open the interactive Web Crawler design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.

Functional requirements

Non-functional requirements & scale

Capacity estimation

5B pages/day = 58K/sec. Average page = 100KB HTML. Storage: 5B × 100KB = 500TB/day (HTML alone). Need DNS resolution caching (same domain re-resolved wastes time). Politeness = max 1 req/sec per domain = need per-domain rate limiting.

Core entities

API design

High-level design

Seed URLs → URL Frontier (priority queue). Scheduler dequeues, routes to worker by domain hash. Worker fetches page → extracts links → dedup check → push new URLs to frontier → store HTML to S3.

Deep dives

🔄 URL Deduplication

Naive: DB lookup for every URL. At 58K URLs/sec with millions of extracted links, this is too slow. Solution: Bloom filter (probabilistic, no false negatives). 10B URLs × 10 bits/entry = 12GB — fits in RAM. False positive rate ~1% (occasionally skip valid URLs — acceptable). Persistent dedup: store URL hashes in Cassandra for exact check before inserting to frontier.

🤝 Politeness & Robots.txt

Politeness: group URLs by domain. Each domain has its own queue. Crawl at most 1 request/sec per domain (configurable). Robots.txt: fetch and cache on first domain visit. Parse Disallow rules — skip matching URLs. Cache robots.txt in Redis (TTL 24h). Respect Crawl-delay header. Ban domains with no response or suspicious behavior.

🌀 Spider Traps

Infinite URL generators: calendar sites (/cal?date=2024-01-01, /cal?date=2024-01-02...). Solutions: (1) Max crawl depth per domain (e.g., depth 5). (2) URL canonicalization — normalize query params, trailing slashes. (3) Hash URL content — if hash matches already-seen page, skip. (4) Domain URL count limit (e.g., max 1M URLs per domain). (5) Detect pattern (/d{4}-d{2}-d{2}/) and throttle.

📦 URL Frontier Priority

Not all URLs are equal. PageRank-like scoring: pages from high-authority sites (nytimes.com) crawled sooner than unknown blogs. Fresh crawling: pages that change frequently (news sites) need re-crawl every hour; Wikipedia pages every week. Use priority queue with multiple tiers. Scheduler: pop from high-priority tier 80% of time, low-priority 20%.

Scaling considerations

What interviewers expect by level

Practice more system design case studies

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…