38 - Design Web Crawler

Series: System Design & Distributed Systems Previous: 37 - Design Dropbox or Google Drive | Next: 39 - Design Notification System

1. Requirements

Functional Requirements

Feature	Details
Crawl web pages	Fetch HTML content from URLs
Discover new URLs	Extract and follow links
Store content	Save page content for indexing
Respect robots.txt	Honor crawl restrictions
Handle duplicates	Detect and skip duplicate content
Recrawl scheduling	Refresh pages based on change frequency
Scale	Crawl billions of pages

Non-Functional Requirements

Scale: Crawl 1B pages/day (Googlebot scale: 20B/day)
Politeness: Don't overload any single website
Robustness: Handle malformed HTML, infinite loops, spider traps
Extensibility: Easy to add new content types (PDF, images)
Freshness: Prioritize recrawling frequently changing pages
Distributed: Run across hundreds of machines

2. Capacity Estimation

Target:                 1B pages/day
Pages/sec:              ~11,600 pages/sec
Avg page size:          100KB (HTML)
Download bandwidth:     11,600 x 100KB = 1.16GB/sec = ~9.3Gbps
Storage/day:            1B x 100KB = 100TB
Avg links per page:     50
New URLs discovered:    1B x 50 = 50B URLs/day (most duplicates)
DNS lookups/sec:        ~11,600 (with caching, much less)

Resource	Estimate
Crawl rate	~12K pages/sec
Bandwidth	~10Gbps
Storage/day	~100TB
URL frontier size	~10B URLs
Unique pages (total)	~100B+

3. High-Level Architecture

+------------------+
|   Seed URLs      |
+--------+---------+
         |
         v
+--------+---------+
|   URL Frontier   |
| (Priority Queue  |
|  + Politeness    |
|    Queues)       |
+--------+---------+
         |
         v
+--------+---------+
|   DNS Resolver   |
|   (Cached)       |
+--------+---------+
         |
         v
+--------+---------+
|   Fetcher        |
|   (HTTP Client)  |
+--------+---------+
         |
         v
+--------+---------+
|   Content Parser |
|   (HTML Parser)  |
+--------+---------+
         |
    +----+----+
    |         |
    v         v
+---+---+ +---+--------+
| Dedup | | Link       |
| Check | | Extractor  |
+---+---+ +---+--------+
    |         |
    v         v
+---+------+  |
| Content  |  |
| Store    |  +---> URL Frontier (loop back)
+----------+

4. BFS Traversal

Why BFS Over DFS?

BFS (Breadth-First Search):
  Level 0: seed URLs (high authority sites)
  Level 1: pages linked from seeds
  Level 2: pages linked from level 1
  ...

  Advantages:
  - Discovers important pages first (closer to seed = higher authority)
  - Broader coverage earlier
  - Natural priority ordering

DFS (Depth-First Search):
  Follows one path deep before backtracking.

  Disadvantages:
  - May go deep into a single site before exploring others
  - Susceptible to spider traps (infinite depth)
  - Poor coverage distribution

Modified BFS with Priority

Not pure BFS -- use priority-based BFS:

  Priority factors:
  1. PageRank / domain authority
  2. Update frequency (news sites > static pages)
  3. Depth from seed (shallower = higher priority)
  4. Content type (HTML > other)

  Priority Queue dequeues highest-priority URL next

5. URL Frontier

The URL frontier is the core data structure -- it manages what to crawl next.

Two-Level Architecture

+------------------------------------------+
|            URL Frontier                   |
+------------------------------------------+
|                                          |
|   Front Queues (Priority)                |
|   +------+ +------+ +------+            |
|   | High | | Med  | | Low  |            |
|   |Prior.| |Prior.| |Prior.|            |
|   +--+---+ +--+---+ +--+---+            |
|      |        |        |                 |
|      +--------+--------+                 |
|               |                          |
|   Back Queues (Politeness)               |
|   +--------+ +--------+ +--------+      |
|   |cnn.com | |bbc.com | |wiki.org|      |
|   |queue   | |queue   | |queue   |      |
|   +--+-----+ +--+-----+ +--+-----+      |
|      |           |           |           |
+------------------------------------------+
       |           |           |
       v           v           v
    Fetcher     Fetcher     Fetcher
    Worker 1    Worker 2    Worker 3

Priority Queue (Front)

Assigns priority based on:
  - Domain authority / PageRank
  - Content freshness signal
  - Business priority (e.g., news sites)
  - Recrawl urgency

Queues:
  High priority: news sites, frequently updated, high PageRank
  Medium priority: regular content sites, blogs
  Low priority: deep pages, low PageRank, rarely changing

Politeness Queue (Back)

One queue per domain:
  Each domain has a minimum crawl interval (e.g., 1 request/second)

Domain-to-queue mapping:
  hash(domain) % N_queues --> queue assignment

Each fetcher worker:
  1. Pick a queue
  2. Check last fetch time for that domain
  3. If enough time elapsed: dequeue and fetch
  4. Otherwise: wait or pick another queue

This ensures:
  - No domain is overwhelmed
  - Crawl-delay from robots.txt is respected
  - Parallelism across different domains

Interview Tip: The two-level frontier (priority + politeness) is the key architectural insight. It balances "what's important to crawl" with "what's safe to crawl now."

6. DNS Resolver Cache

Problem

Each URL fetch requires a DNS lookup. At 12K fetches/sec, this creates 12K DNS queries/sec.

Solution

DNS Resolution Pipeline:

  URL: https://www.example.com/page.html
         |
         v
  Local DNS Cache (in-memory, TTL-based)
         |
    +----+----+
    |         |
  HIT       MISS
    |         |
    v         v
  Return   Query DNS Server
  cached     |
  IP         v
           Cache result (TTL = min(DNS TTL, 24h))
           Return IP

Cache Design

Property	Value
Structure	Hash map: domain -> (IP, expiry)
Size	~10M domains = ~500MB
Hit rate	~90%+ (most fetches are repeat domains)
TTL	min(DNS record TTL, 24 hours)
Refresh	Background refresh before expiry

Batch DNS Resolution

Pre-resolve domains from the frontier in batches
Use multiple DNS servers for redundancy
Implement DNS-over-HTTPS for reliability

7. Content Parser

Raw HTML
    |
    v
+---+-------------+
| HTML Parser     |
| (jsoup, lxml)   |
+---+-------------+
    |
    +------+------+------+
    |      |      |      |
    v      v      v      v
  Text   Meta   Links  Media
  Content data   URLs   URLs
    |      |      |      |
    v      v      v      v
  Store  Store  Feed   Optional
  for    for    back   media
  index  index  to     crawl
               frontier

Parsing Steps

Encoding detection: Determine charset (UTF-8, ISO-8859-1, etc.)
HTML parsing: Build DOM tree, handle malformed HTML
Text extraction: Strip tags, get visible text content
Metadata extraction: title, description, keywords, language, author
Link extraction: href attributes, canonical URLs, redirect targets
URL normalization: resolve relative URLs, remove fragments, lowercase

URL Normalization

Input URLs that are all the same page:
  https://Example.COM/path/page.html
  https://example.com/path/page.html#section
  https://example.com/path/../path/page.html
  https://example.com/path/page.html?utm_source=twitter

Normalized:
  https://example.com/path/page.html

Rules:
  1. Lowercase scheme and host
  2. Remove fragment (#...)
  3. Resolve relative paths (../)
  4. Remove tracking parameters (utm_*)
  5. Remove default ports (:80, :443)
  6. Remove trailing slash (configurable)

8. Duplicate Detection

URL Deduplication (Bloom Filter)

Before adding URL to frontier, check if already seen:

Bloom Filter:
  - Probabilistic "set membership" test
  - False positives possible (skip some unseen URLs -- acceptable)
  - False negatives impossible (never re-crawl seen URL -- guaranteed)
  - Space efficient: 1% FP rate for 10B URLs = ~12GB

  Insert: hash URL with k hash functions, set k bits
  Query:  hash URL, check if all k bits set
    All set --> probably seen (skip)
    Any unset --> definitely not seen (add to frontier)

Content Deduplication (SimHash)

Problem: Different URLs can serve identical or near-identical content
  - Mirror sites: mirror.example.com vs example.com
  - Print versions: example.com/article vs example.com/article?print=1
  - Pagination: same content across page 1 and page 2 intro

SimHash (Charikar's technique):
  1. Extract features (word n-grams) from page
  2. Hash each feature to a 64-bit value
  3. Weight and combine hashes into a 64-bit fingerprint
  4. Compare fingerprints: Hamming distance < 3 = near-duplicate

Example:
  Page A fingerprint: 1010110001...  (64 bits)
  Page B fingerprint: 1010110011...  (64 bits)
  Hamming distance: 1 (only 1 bit differs)
  --> Near-duplicate detected!

Dedup Decision Matrix

Hamming Distance	Interpretation	Action
0	Exact duplicate	Skip entirely
1-3	Near-duplicate	Skip or store one canonical
4-10	Similar but distinct	Crawl, may cluster
> 10	Different content	Crawl normally

Interview Tip: Know that Bloom filters handle URL dedup (billions of URLs, need space efficiency) and SimHash handles content dedup (detect mirror/duplicate pages).

9. Distributed Architecture

Multi-Worker Design

                    +------------------+
                    |   Coordinator    |
                    | (URL assignment, |
                    |  monitoring)     |
                    +--------+---------+
                             |
            +----------------+----------------+
            |                |                |
     +------+------+  +------+------+  +------+------+
     | Worker 1    |  | Worker 2    |  | Worker N    |
     | (Region: US)|  | (Region: EU)|  | (Region:Asia|
     +------+------+  +------+------+  +------+------+
            |                |                |
     +------+------+  +------+------+  +------+------+
     | Local       |  | Local       |  | Local       |
     | Frontier    |  | Frontier    |  | Frontier    |
     +------+------+  +------+------+  +------+------+
            |                |                |
     +------+------+  +------+------+  +------+------+
     | DNS Cache   |  | DNS Cache   |  | DNS Cache   |
     +-------------+  +-------------+  +-------------+

URL Partitioning

Strategy: Partition by domain hash

  hash("cnn.com") % N_workers = worker_3
  hash("bbc.com") % N_workers = worker_1

Benefits:
  - Each worker "owns" a set of domains
  - Politeness is naturally enforced per worker
  - DNS caches are per-worker (high hit rate)
  - robots.txt cached per worker

Alternative: Geographic partitioning
  - US workers crawl .com, .us domains
  - EU workers crawl .co.uk, .de, .fr domains
  - Lower latency for fetching

Scaling

Component	Scaling Strategy
Frontier	Distributed queue (Redis, Kafka)
Fetcher	Add more workers (horizontal)
Content store	Distributed storage (HDFS, S3)
Dedup (Bloom)	Partitioned Bloom filters
DNS cache	Per-worker local cache

10. robots.txt Respect

Caching Strategy

Before crawling any URL on a domain:
  1. Check robots.txt cache for domain
  2. If cached and fresh: apply rules
  3. If not cached or stale:
     a. Fetch robots.txt (with timeout)
     b. Parse and cache (TTL = 24 hours)
     c. Apply rules

robots.txt fetch failures:
  - 4xx (not found): assume no restrictions
  - 5xx (server error): assume full restriction (conservative)
  - Timeout: retry later, skip domain for now

robots.txt Parser

User-agent: *
Disallow: /admin/
Disallow: /api/
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml

Parsed rules:
  {
    "user-agent": "*",
    "disallowed": ["/admin/", "/api/"],
    "crawl_delay": 2,
    "sitemaps": ["https://example.com/sitemap.xml"]
  }

Before fetching https://example.com/admin/login:
  Path "/admin/login" matches disallow "/admin/" --> SKIP

11. Handling Traps and Edge Cases

Spider Traps

Trap: Infinite URL generation

  example.com/page/1
  example.com/page/1/2
  example.com/page/1/2/3
  example.com/page/1/2/3/4
  ... (infinite depth)

Calendar trap:
  example.com/calendar/2024/01/01
  example.com/calendar/2024/01/02
  ... (infinite dates)

Session ID trap:
  example.com/page?session=abc123
  example.com/page?session=def456
  ... (infinite unique URLs, same content)

Countermeasures

Trap Type	Defense
Infinite depth	Max URL depth limit (e.g., 15 levels)
Calendar/date URLs	Detect date patterns, limit range
Session IDs	URL normalization (strip session params)
Query permutations	Canonical URL extraction
Redirect loops	Max redirect count (e.g., 10)
Slow servers	Request timeout (30 seconds)
Extremely large pages	Max page size limit (10MB)

URL Depth Limiting

Seed URL: https://example.com/ (depth 0)
  |
  +-- /about (depth 1)
  |     +-- /about/team (depth 2)
  |           +-- /about/team/john (depth 3)
  |                 +-- ... (depth > max_depth: STOP)
  |
  +-- /products (depth 1)
        +-- /products/widget (depth 2)

Domain-Level Throttling

If a domain generates > 10K unique URLs: flag for review
If content fingerprints are all similar: likely a trap
If average page quality score drops: reduce priority

12. Freshness: Recrawl Scheduling

The Challenge

Cannot recrawl all 100B pages every day. Must prioritize.

Recrawl Priority Formula

recrawl_priority = importance x staleness x change_probability

importance:        PageRank, domain authority
staleness:         time since last crawl
change_probability: historical change frequency of the page

Change Frequency Estimation

Track historical crawl data:

  Page A: changed 8 of last 10 crawls
    --> change_freq = 0.8 (crawl frequently)

  Page B: changed 1 of last 10 crawls
    --> change_freq = 0.1 (crawl rarely)

Adaptive scheduling:
  Page A: recrawl every 1 hour
  Page B: recrawl every 30 days

Freshness Categories

Category	Recrawl Interval	Examples
Real-time	Minutes	News front pages, live scores
Frequent	Hours	News articles, social media
Regular	Days	Blog posts, product pages
Rare	Weeks-Months	Static pages, documentation
Archive	Never (manual)	Dead pages, historical content

Signals for Recrawl Urgency

RSS/Atom feed updates (subscribe to site feeds)
Sitemap lastmod timestamps
HTTP Last-Modified / ETag headers
Social media mentions (trending topics -> recrawl related pages)
Webmaster tools ping (site owners notify of updates)

13. Scaling to Billions

Performance Optimizations

Optimization	Impact
HTTP keep-alive	Reuse connections per domain, reduce TCP overhead
Parallel fetching	100s of concurrent connections per worker
Compressed transfer	Accept gzip/deflate, reduce bandwidth 70%
Conditional GET	If-Modified-Since header, skip unchanged pages
Head-of-line blocking avoidance	HTTP/2 multiplexing
Connection pooling	Pool per domain, limit connections

Storage Architecture

Crawled Content Pipeline:

  Fetcher --> Raw HTML --> Content Store (S3/HDFS)
                              |
                              v
                        +-----+------+
                        | MapReduce  |
                        | Processing |
                        +-----+------+
                              |
                   +----------+----------+
                   |          |          |
              Parse &    Build      Compute
              Extract    Inverted   PageRank
              Text       Index

Monitoring and Health

Key metrics to track:
  - Pages crawled per second (throughput)
  - Error rate by type (4xx, 5xx, timeout, DNS failure)
  - Average fetch latency per domain
  - Frontier size (growing = discovery, shrinking = exhaustion)
  - Duplicate rate (too high = wasted work)
  - robots.txt block rate per domain
  - Content freshness score (how stale is our index?)

14. Complete System Diagram

+-------------+
|  Seed URLs  |
+------+------+
       |
       v
+------+------+------+
|    URL Frontier     |
| +-------+ +------+ |
| |Priority| |Polite| |
| |Queues  | |Queues| |
| +-------+ +------+ |
+------+--------------+
       |
       v
+------+------+                    +------------------+
| URL Dedup   |                    | robots.txt Cache |
| (Bloom      |                    | (per domain)     |
| Filter)     |                    +--------+---------+
+------+------+                             |
       |                                    |
       v                                    v
+------+------+------+------+------+--------+---------+
|                Fetcher Workers (distributed)         |
|  +--------+  +--------+  +--------+  +--------+     |
|  |Worker 1|  |Worker 2|  |Worker 3|  |Worker N|     |
|  +---+----+  +---+----+  +---+----+  +---+----+     |
+------+------------+------------+------------+--------+
       |            |            |            |
       +------+-----+-----+-----+            |
              |           |                   |
       +------+------+    |            +------+------+
       | DNS Cache   |    |            | HTTP        |
       | (in-memory) |    |            | Connection  |
       +-------------+    |            | Pool        |
                          |            +-------------+
                          v
               +----------+----------+
               |   Content Parser    |
               | +------+ +-------+ |
               | |Text  | |Link   | |
               | |Extract| |Extract| |
               | +------+ +-------+ |
               +----------+----------+
                    |          |
              +-----+    +----+
              |          |
              v          v
       +------+---+ +---+--------+
       | Content  | | URL        |
       | Dedup    | | Normalizer |
       | (SimHash)| +---+--------+
       +------+---+     |
              |          v
              v     Back to URL Frontier
       +------+------+
       | Content     |
       | Store       |
       | (S3 / HDFS) |
       +------+------+
              |
              v
       +------+------+
       | Index       |
       | Pipeline    |
       | (MapReduce) |
       +-------------+

15. Interview Tips

URL frontier is THE core: Two-level design (priority + politeness) is essential
BFS not DFS: Explain why breadth-first discovers important pages faster
Bloom filter for URL dedup: Know the math -- 10B URLs, 1% FP = ~12GB
SimHash for content dedup: Fingerprint comparison, Hamming distance threshold
Politeness is not optional: One request per domain per second is the baseline
robots.txt: Cache per domain, handle failure modes gracefully
Spider traps: Depth limits, URL normalization, domain-level anomaly detection
Freshness is a scheduling problem: Prioritize recrawl based on change frequency
Scale numbers matter: Know 1B pages/day = ~12K pages/sec = ~10Gbps
Don't forget DNS caching: It's a hidden bottleneck at crawl scale

16. Resources

Alex Xu - System Design Interview Vol. 1, Chapter 9: Design a Web Crawler
"The Anatomy of a Large-Scale Hypertextual Web Search Engine" (Brin & Page, 1998)
"IRLbot: Scaling to 6 Billion Pages and Beyond" (Lee et al.)
"Web Crawling" chapter in Introduction to Information Retrieval (Manning et al.)
Mercator: A Scalable, Extensible Web Crawler (Heydon & Najork, 1999)
Bloom Filter tutorial (brilliant.org/wiki/bloom-filter)
SimHash paper: "Detecting Near-Duplicates for Web Crawling" (Manku et al., 2007)
Martin Kleppmann - Designing Data-Intensive Applications, Chapter 10 (Batch Processing)

Previous: 37 - Design Dropbox or Google Drive | Next: 39 - Design Notification System