🕷️ Design Web Crawler — System Design Interview Guide
Medium · Data Ingestion
Design a distributed web crawler like Googlebot that systematically discovers and downloads web pages at scale, for purposes like search indexing, archiving, or data mining.
Open the interactive Web Crawler design on PrepGrind → Drag load balancers, caches, databases, and queues onto a canvas, run a live traffic simulation to watch latency and bottlenecks under load, and follow the full interview walkthrough below — free, in your browser.
Functional requirements
- Crawl billions of web pages starting from seed URLs
- Extract links from crawled pages and enqueue for crawling
- Respect robots.txt and crawl-delay directives
- De-duplicate: don't crawl the same URL twice
- Store page content for indexing or archiving
- Handle HTTP redirects and error codes appropriately
Non-functional requirements & scale
- Crawl 5B pages per day (~58,000 pages/sec)
- URL frontier must hold billions of URLs without OOM
- Politeness: no more than 1 request/sec per domain
- Fault-tolerant: worker crash doesn't lose crawl progress
- Scalable: add workers to increase throughput linearly
- Avoid spider traps (infinite URL-generating sites)
Capacity estimation
5B pages/day = 58K/sec. Average page = 100KB HTML. Storage: 5B × 100KB = 500TB/day (HTML alone). Need DNS resolution caching (same domain re-resolved wastes time). Politeness = max 1 req/sec per domain = need per-domain rate limiting.
Core entities
- URL — url, normalizedUrl, domain, status, depth, discoveredAt, crawledAt
- Page — pageId, url, contentHash, htmlContent (S3 key), httpStatus, crawledAt
- Domain — domain, robotsTxt, crawlDelay, lastCrawledAt, isBanned
API design
Internal Scheduler → Worker Queue— URL Scheduler dequeues from frontier and assigns batches to workers.Internal Worker → Content Store— Worker fetches URL, stores HTML in S3, extracts links, acks queue.GET /admin/stats— Crawl progress: pages/sec, queue depth, error rates per domain.
High-level design
Seed URLs → URL Frontier (priority queue). Scheduler dequeues, routes to worker by domain hash. Worker fetches page → extracts links → dedup check → push new URLs to frontier → store HTML to S3.
Deep dives
🔄 URL Deduplication
Naive: DB lookup for every URL. At 58K URLs/sec with millions of extracted links, this is too slow. Solution: Bloom filter (probabilistic, no false negatives). 10B URLs × 10 bits/entry = 12GB — fits in RAM. False positive rate ~1% (occasionally skip valid URLs — acceptable). Persistent dedup: store URL hashes in Cassandra for exact check before inserting to frontier.
🤝 Politeness & Robots.txt
Politeness: group URLs by domain. Each domain has its own queue. Crawl at most 1 request/sec per domain (configurable). Robots.txt: fetch and cache on first domain visit. Parse Disallow rules — skip matching URLs. Cache robots.txt in Redis (TTL 24h). Respect Crawl-delay header. Ban domains with no response or suspicious behavior.
🌀 Spider Traps
Infinite URL generators: calendar sites (/cal?date=2024-01-01, /cal?date=2024-01-02...). Solutions: (1) Max crawl depth per domain (e.g., depth 5). (2) URL canonicalization — normalize query params, trailing slashes. (3) Hash URL content — if hash matches already-seen page, skip. (4) Domain URL count limit (e.g., max 1M URLs per domain). (5) Detect pattern (/d{4}-d{2}-d{2}/) and throttle.
📦 URL Frontier Priority
Not all URLs are equal. PageRank-like scoring: pages from high-authority sites (nytimes.com) crawled sooner than unknown blogs. Fresh crawling: pages that change frequently (news sites) need re-crawl every hour; Wikipedia pages every week. Use priority queue with multiple tiers. Scheduler: pop from high-priority tier 80% of time, low-priority 20%.
Scaling considerations
- Kafka URL frontier: partitioned by domain hash for politeness grouping
- Bloom filter sharded in Redis cluster: each shard handles prefix of URL hash
- Worker fleet: auto-scales based on Kafka consumer lag
- DNS cache in Redis prevents repeated resolution (huge latency saving)
- S3 for HTML: cheap, durable, no single point of failure
What interviewers expect by level
- Junior: Describe basic crawl loop: fetch → parse links → enqueue. Know robots.txt purpose.
- Mid: Bloom filter dedup, domain-based politeness queues, DNS caching, distributed worker pool.
- Senior: Priority-based frontier, spider trap detection, URL normalization, at-least-once crawl semantics.
- Staff: Full Google-scale design: 5B pages/day, re-crawl scheduling with freshness signals, link graph for PageRank.
Practice more system design case studies
- Design URL Shortener
- Design Social Media Feed
- Design Chat System
- Design Video Streaming
- Design Ride-Sharing Platform
- Design E-Commerce Platform
- Design UPI Payment Gateway
- Design Google Docs
- Design Tinder
- Design Google Drive / Dropbox
- Design Instagram
- Design Type-Ahead Search
- Design Ticket Booking (BookMyShow)
- Design Pastebin
- Design Notification System
- Design Rate Limiter (Standalone)
- Design Simple Web App
- Design Food Delivery (Swiggy)
- Design Stock Trading System
- Design Live Streaming (Twitch)
- Design Distributed Key-Value Store
- Design Ad Click Aggregation
- Design Monitoring / Metrics (Datadog)
- Design Online Judge (LeetCode)
- Design FB Post Search
- Design Yelp
- Design Cache Layer
- Design Message Queue
- Design Full Production Stack
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…