38 - Design Web Crawler
Series: System Design & Distributed Systems
Previous: 37 - Design Dropbox or Google Drive | Next: 39 - Design Notification System
1. Requirements
Functional Requirements
| Feature | Details |
|---|
| Crawl web pages | Fetch HTML content from URLs |
| Discover new URLs | Extract and follow links |
| Store content | Save page content for indexing |
| Respect robots.txt | Honor crawl restrictions |
| Handle duplicates | Detect and skip duplicate content |
| Recrawl scheduling | Refresh pages based on change frequency |
| Scale | Crawl billions of pages |
Non-Functional Requirements
- Scale: Crawl 1B pages/day (Googlebot scale: 20B/day)
- Politeness: Don't overload any single website
- Robustness: Handle malformed HTML, infinite loops, spider traps
- Extensibility: Easy to add new content types (PDF, images)
- Freshness: Prioritize recrawling frequently changing pages
- Distributed: Run across hundreds of machines
2. Capacity Estimation
Target: 1B pages/day
Pages/sec: ~11,600 pages/sec
Avg page size: 100KB (HTML)
Download bandwidth: 11,600 x 100KB = 1.16GB/sec = ~9.3Gbps
Storage/day: 1B x 100KB = 100TB
Avg links per page: 50
New URLs discovered: 1B x 50 = 50B URLs/day (most duplicates)
DNS lookups/sec: ~11,600 (with caching, much less)
| Resource | Estimate |
|---|
| Crawl rate | ~12K pages/sec |
| Bandwidth | ~10Gbps |
| Storage/day | ~100TB |
| URL frontier size | ~10B URLs |
| Unique pages (total) | ~100B+ |
3. High-Level Architecture
+------------------+
| Seed URLs |
+--------+---------+
|
v
+--------+---------+
| URL Frontier |
| (Priority Queue |
| + Politeness |
| Queues) |
+--------+---------+
|
v
+--------+---------+
| DNS Resolver |
| (Cached) |
+--------+---------+
|
v
+--------+---------+
| Fetcher |
| (HTTP Client) |
+--------+---------+
|
v
+--------+---------+
| Content Parser |
| (HTML Parser) |
+--------+---------+
|
+----+----+
| |
v v
+---+---+ +---+--------+
| Dedup | | Link |
| Check | | Extractor |
+---+---+ +---+--------+
| |
v v
+---+------+ |
| Content | |
| Store | +---> URL Frontier (loop back)
+----------+
4. BFS Traversal
Why BFS Over DFS?
BFS (Breadth-First Search):
Level 0: seed URLs (high authority sites)
Level 1: pages linked from seeds
Level 2: pages linked from level 1
...
Advantages:
- Discovers important pages first (closer to seed = higher authority)
- Broader coverage earlier
- Natural priority ordering
DFS (Depth-First Search):
Follows one path deep before backtracking.
Disadvantages:
- May go deep into a single site before exploring others
- Susceptible to spider traps (infinite depth)
- Poor coverage distribution
Modified BFS with Priority
Not pure BFS -- use priority-based BFS:
Priority factors:
1. PageRank / domain authority
2. Update frequency (news sites > static pages)
3. Depth from seed (shallower = higher priority)
4. Content type (HTML > other)
Priority Queue dequeues highest-priority URL next
5. URL Frontier
The URL frontier is the core data structure -- it manages what to crawl next.
Two-Level Architecture
+------------------------------------------+
| URL Frontier |
+------------------------------------------+
| |
| Front Queues (Priority) |
| +------+ +------+ +------+ |
| | High | | Med | | Low | |
| |Prior.| |Prior.| |Prior.| |
| +--+---+ +--+---+ +--+---+ |
| | | | |
| +--------+--------+ |
| | |
| Back Queues (Politeness) |
| +--------+ +--------+ +--------+ |
| |cnn.com | |bbc.com | |wiki.org| |
| |queue | |queue | |queue | |
| +--+-----+ +--+-----+ +--+-----+ |
| | | | |
+------------------------------------------+
| | |
v v v
Fetcher Fetcher Fetcher
Worker 1 Worker 2 Worker 3
Priority Queue (Front)
Assigns priority based on:
- Domain authority / PageRank
- Content freshness signal
- Business priority (e.g., news sites)
- Recrawl urgency
Queues:
High priority: news sites, frequently updated, high PageRank
Medium priority: regular content sites, blogs
Low priority: deep pages, low PageRank, rarely changing
Politeness Queue (Back)
One queue per domain:
Each domain has a minimum crawl interval (e.g., 1 request/second)
Domain-to-queue mapping:
hash(domain) % N_queues --> queue assignment
Each fetcher worker:
1. Pick a queue
2. Check last fetch time for that domain
3. If enough time elapsed: dequeue and fetch
4. Otherwise: wait or pick another queue
This ensures:
- No domain is overwhelmed
- Crawl-delay from robots.txt is respected
- Parallelism across different domains
Interview Tip: The two-level frontier (priority + politeness) is the key architectural insight. It balances "what's important to crawl" with "what's safe to crawl now."
6. DNS Resolver Cache
Problem
Each URL fetch requires a DNS lookup. At 12K fetches/sec, this creates 12K DNS queries/sec.
Solution
DNS Resolution Pipeline:
URL: https://www.example.com/page.html
|
v
Local DNS Cache (in-memory, TTL-based)
|
+----+----+
| |
HIT MISS
| |
v v
Return Query DNS Server
cached |
IP v
Cache result (TTL = min(DNS TTL, 24h))
Return IP
Cache Design
| Property | Value |
|---|
| Structure | Hash map: domain -> (IP, expiry) |
| Size | ~10M domains = ~500MB |
| Hit rate | ~90%+ (most fetches are repeat domains) |
| TTL | min(DNS record TTL, 24 hours) |
| Refresh | Background refresh before expiry |
Batch DNS Resolution
- Pre-resolve domains from the frontier in batches
- Use multiple DNS servers for redundancy
- Implement DNS-over-HTTPS for reliability
7. Content Parser
Raw HTML
|
v
+---+-------------+
| HTML Parser |
| (jsoup, lxml) |
+---+-------------+
|
+------+------+------+
| | | |
v v v v
Text Meta Links Media
Content data URLs URLs
| | | |
v v v v
Store Store Feed Optional
for for back media
index index to crawl
frontier
Parsing Steps
- Encoding detection: Determine charset (UTF-8, ISO-8859-1, etc.)
- HTML parsing: Build DOM tree, handle malformed HTML
- Text extraction: Strip tags, get visible text content
- Metadata extraction: title, description, keywords, language, author
- Link extraction: href attributes, canonical URLs, redirect targets
- URL normalization: resolve relative URLs, remove fragments, lowercase
URL Normalization
Input URLs that are all the same page:
https://Example.COM/path/page.html
https://example.com/path/page.html#section
https://example.com/path/../path/page.html
https://example.com/path/page.html?utm_source=twitter
Normalized:
https://example.com/path/page.html
Rules:
1. Lowercase scheme and host
2. Remove fragment (#...)
3. Resolve relative paths (../)
4. Remove tracking parameters (utm_*)
5. Remove default ports (:80, :443)
6. Remove trailing slash (configurable)
8. Duplicate Detection
URL Deduplication (Bloom Filter)
Before adding URL to frontier, check if already seen:
Bloom Filter:
- Probabilistic "set membership" test
- False positives possible (skip some unseen URLs -- acceptable)
- False negatives impossible (never re-crawl seen URL -- guaranteed)
- Space efficient: 1% FP rate for 10B URLs = ~12GB
Insert: hash URL with k hash functions, set k bits
Query: hash URL, check if all k bits set
All set --> probably seen (skip)
Any unset --> definitely not seen (add to frontier)
Content Deduplication (SimHash)
Problem: Different URLs can serve identical or near-identical content
- Mirror sites: mirror.example.com vs example.com
- Print versions: example.com/article vs example.com/article?print=1
- Pagination: same content across page 1 and page 2 intro
SimHash (Charikar's technique):
1. Extract features (word n-grams) from page
2. Hash each feature to a 64-bit value
3. Weight and combine hashes into a 64-bit fingerprint
4. Compare fingerprints: Hamming distance < 3 = near-duplicate
Example:
Page A fingerprint: 1010110001... (64 bits)
Page B fingerprint: 1010110011... (64 bits)
Hamming distance: 1 (only 1 bit differs)
--> Near-duplicate detected!
Dedup Decision Matrix
| Hamming Distance | Interpretation | Action |
|---|
| 0 | Exact duplicate | Skip entirely |
| 1-3 | Near-duplicate | Skip or store one canonical |
| 4-10 | Similar but distinct | Crawl, may cluster |
| > 10 | Different content | Crawl normally |
Interview Tip: Know that Bloom filters handle URL dedup (billions of URLs, need space efficiency) and SimHash handles content dedup (detect mirror/duplicate pages).
9. Distributed Architecture
Multi-Worker Design
+------------------+
| Coordinator |
| (URL assignment, |
| monitoring) |
+--------+---------+
|
+----------------+----------------+
| | |
+------+------+ +------+------+ +------+------+
| Worker 1 | | Worker 2 | | Worker N |
| (Region: US)| | (Region: EU)| | (Region:Asia|
+------+------+ +------+------+ +------+------+
| | |
+------+------+ +------+------+ +------+------+
| Local | | Local | | Local |
| Frontier | | Frontier | | Frontier |
+------+------+ +------+------+ +------+------+
| | |
+------+------+ +------+------+ +------+------+
| DNS Cache | | DNS Cache | | DNS Cache |
+-------------+ +-------------+ +-------------+
URL Partitioning
Strategy: Partition by domain hash
hash("cnn.com") % N_workers = worker_3
hash("bbc.com") % N_workers = worker_1
Benefits:
- Each worker "owns" a set of domains
- Politeness is naturally enforced per worker
- DNS caches are per-worker (high hit rate)
- robots.txt cached per worker
Alternative: Geographic partitioning
- US workers crawl .com, .us domains
- EU workers crawl .co.uk, .de, .fr domains
- Lower latency for fetching
Scaling
| Component | Scaling Strategy |
|---|
| Frontier | Distributed queue (Redis, Kafka) |
| Fetcher | Add more workers (horizontal) |
| Content store | Distributed storage (HDFS, S3) |
| Dedup (Bloom) | Partitioned Bloom filters |
| DNS cache | Per-worker local cache |
10. robots.txt Respect
Caching Strategy
Before crawling any URL on a domain:
1. Check robots.txt cache for domain
2. If cached and fresh: apply rules
3. If not cached or stale:
a. Fetch robots.txt (with timeout)
b. Parse and cache (TTL = 24 hours)
c. Apply rules
robots.txt fetch failures:
- 4xx (not found): assume no restrictions
- 5xx (server error): assume full restriction (conservative)
- Timeout: retry later, skip domain for now
robots.txt Parser
User-agent: *
Disallow: /admin/
Disallow: /api/
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml
Parsed rules:
{
"user-agent": "*",
"disallowed": ["/admin/", "/api/"],
"crawl_delay": 2,
"sitemaps": ["https://example.com/sitemap.xml"]
}
Before fetching https://example.com/admin/login:
Path "/admin/login" matches disallow "/admin/" --> SKIP
11. Handling Traps and Edge Cases
Spider Traps
Trap: Infinite URL generation
example.com/page/1
example.com/page/1/2
example.com/page/1/2/3
example.com/page/1/2/3/4
... (infinite depth)
Calendar trap:
example.com/calendar/2024/01/01
example.com/calendar/2024/01/02
... (infinite dates)
Session ID trap:
example.com/page?session=abc123
example.com/page?session=def456
... (infinite unique URLs, same content)
Countermeasures
| Trap Type | Defense |
|---|
| Infinite depth | Max URL depth limit (e.g., 15 levels) |
| Calendar/date URLs | Detect date patterns, limit range |
| Session IDs | URL normalization (strip session params) |
| Query permutations | Canonical URL extraction |
| Redirect loops | Max redirect count (e.g., 10) |
| Slow servers | Request timeout (30 seconds) |
| Extremely large pages | Max page size limit (10MB) |
URL Depth Limiting
Seed URL: https://example.com/ (depth 0)
|
+-- /about (depth 1)
| +-- /about/team (depth 2)
| +-- /about/team/john (depth 3)
| +-- ... (depth > max_depth: STOP)
|
+-- /products (depth 1)
+-- /products/widget (depth 2)
Domain-Level Throttling
If a domain generates > 10K unique URLs: flag for review
If content fingerprints are all similar: likely a trap
If average page quality score drops: reduce priority
12. Freshness: Recrawl Scheduling
The Challenge
Cannot recrawl all 100B pages every day. Must prioritize.
Recrawl Priority Formula
recrawl_priority = importance x staleness x change_probability
importance: PageRank, domain authority
staleness: time since last crawl
change_probability: historical change frequency of the page
Change Frequency Estimation
Track historical crawl data:
Page A: changed 8 of last 10 crawls
--> change_freq = 0.8 (crawl frequently)
Page B: changed 1 of last 10 crawls
--> change_freq = 0.1 (crawl rarely)
Adaptive scheduling:
Page A: recrawl every 1 hour
Page B: recrawl every 30 days
Freshness Categories
| Category | Recrawl Interval | Examples |
|---|
| Real-time | Minutes | News front pages, live scores |
| Frequent | Hours | News articles, social media |
| Regular | Days | Blog posts, product pages |
| Rare | Weeks-Months | Static pages, documentation |
| Archive | Never (manual) | Dead pages, historical content |
Signals for Recrawl Urgency
- RSS/Atom feed updates (subscribe to site feeds)
- Sitemap lastmod timestamps
- HTTP Last-Modified / ETag headers
- Social media mentions (trending topics -> recrawl related pages)
- Webmaster tools ping (site owners notify of updates)
13. Scaling to Billions
Performance Optimizations
| Optimization | Impact |
|---|
| HTTP keep-alive | Reuse connections per domain, reduce TCP overhead |
| Parallel fetching | 100s of concurrent connections per worker |
| Compressed transfer | Accept gzip/deflate, reduce bandwidth 70% |
| Conditional GET | If-Modified-Since header, skip unchanged pages |
| Head-of-line blocking avoidance | HTTP/2 multiplexing |
| Connection pooling | Pool per domain, limit connections |
Storage Architecture
Crawled Content Pipeline:
Fetcher --> Raw HTML --> Content Store (S3/HDFS)
|
v
+-----+------+
| MapReduce |
| Processing |
+-----+------+
|
+----------+----------+
| | |
Parse & Build Compute
Extract Inverted PageRank
Text Index
Monitoring and Health
Key metrics to track:
- Pages crawled per second (throughput)
- Error rate by type (4xx, 5xx, timeout, DNS failure)
- Average fetch latency per domain
- Frontier size (growing = discovery, shrinking = exhaustion)
- Duplicate rate (too high = wasted work)
- robots.txt block rate per domain
- Content freshness score (how stale is our index?)
14. Complete System Diagram
+-------------+
| Seed URLs |
+------+------+
|
v
+------+------+------+
| URL Frontier |
| +-------+ +------+ |
| |Priority| |Polite| |
| |Queues | |Queues| |
| +-------+ +------+ |
+------+--------------+
|
v
+------+------+ +------------------+
| URL Dedup | | robots.txt Cache |
| (Bloom | | (per domain) |
| Filter) | +--------+---------+
+------+------+ |
| |
v v
+------+------+------+------+------+--------+---------+
| Fetcher Workers (distributed) |
| +--------+ +--------+ +--------+ +--------+ |
| |Worker 1| |Worker 2| |Worker 3| |Worker N| |
| +---+----+ +---+----+ +---+----+ +---+----+ |
+------+------------+------------+------------+--------+
| | | |
+------+-----+-----+-----+ |
| | |
+------+------+ | +------+------+
| DNS Cache | | | HTTP |
| (in-memory) | | | Connection |
+-------------+ | | Pool |
| +-------------+
v
+----------+----------+
| Content Parser |
| +------+ +-------+ |
| |Text | |Link | |
| |Extract| |Extract| |
| +------+ +-------+ |
+----------+----------+
| |
+-----+ +----+
| |
v v
+------+---+ +---+--------+
| Content | | URL |
| Dedup | | Normalizer |
| (SimHash)| +---+--------+
+------+---+ |
| v
v Back to URL Frontier
+------+------+
| Content |
| Store |
| (S3 / HDFS) |
+------+------+
|
v
+------+------+
| Index |
| Pipeline |
| (MapReduce) |
+-------------+
15. Interview Tips
- URL frontier is THE core: Two-level design (priority + politeness) is essential
- BFS not DFS: Explain why breadth-first discovers important pages faster
- Bloom filter for URL dedup: Know the math -- 10B URLs, 1% FP = ~12GB
- SimHash for content dedup: Fingerprint comparison, Hamming distance threshold
- Politeness is not optional: One request per domain per second is the baseline
- robots.txt: Cache per domain, handle failure modes gracefully
- Spider traps: Depth limits, URL normalization, domain-level anomaly detection
- Freshness is a scheduling problem: Prioritize recrawl based on change frequency
- Scale numbers matter: Know 1B pages/day = ~12K pages/sec = ~10Gbps
- Don't forget DNS caching: It's a hidden bottleneck at crawl scale
16. Resources
- Alex Xu - System Design Interview Vol. 1, Chapter 9: Design a Web Crawler
- "The Anatomy of a Large-Scale Hypertextual Web Search Engine" (Brin & Page, 1998)
- "IRLbot: Scaling to 6 Billion Pages and Beyond" (Lee et al.)
- "Web Crawling" chapter in Introduction to Information Retrieval (Manning et al.)
- Mercator: A Scalable, Extensible Web Crawler (Heydon & Najork, 1999)
- Bloom Filter tutorial (brilliant.org/wiki/bloom-filter)
- SimHash paper: "Detecting Near-Duplicates for Web Crawling" (Manku et al., 2007)
- Martin Kleppmann - Designing Data-Intensive Applications, Chapter 10 (Batch Processing)
Previous: 37 - Design Dropbox or Google Drive | Next: 39 - Design Notification System