38 - Design Web Crawler

Series: System Design & Distributed Systems Previous: 37 - Design Dropbox or Google Drive | Next: 39 - Design Notification System


1. Requirements

Functional Requirements

FeatureDetails
Crawl web pagesFetch HTML content from URLs
Discover new URLsExtract and follow links
Store contentSave page content for indexing
Respect robots.txtHonor crawl restrictions
Handle duplicatesDetect and skip duplicate content
Recrawl schedulingRefresh pages based on change frequency
ScaleCrawl billions of pages

Non-Functional Requirements

  • Scale: Crawl 1B pages/day (Googlebot scale: 20B/day)
  • Politeness: Don't overload any single website
  • Robustness: Handle malformed HTML, infinite loops, spider traps
  • Extensibility: Easy to add new content types (PDF, images)
  • Freshness: Prioritize recrawling frequently changing pages
  • Distributed: Run across hundreds of machines

2. Capacity Estimation

Target:                 1B pages/day
Pages/sec:              ~11,600 pages/sec
Avg page size:          100KB (HTML)
Download bandwidth:     11,600 x 100KB = 1.16GB/sec = ~9.3Gbps
Storage/day:            1B x 100KB = 100TB
Avg links per page:     50
New URLs discovered:    1B x 50 = 50B URLs/day (most duplicates)
DNS lookups/sec:        ~11,600 (with caching, much less)
ResourceEstimate
Crawl rate~12K pages/sec
Bandwidth~10Gbps
Storage/day~100TB
URL frontier size~10B URLs
Unique pages (total)~100B+

3. High-Level Architecture

+------------------+
|   Seed URLs      |
+--------+---------+
         |
         v
+--------+---------+
|   URL Frontier   |
| (Priority Queue  |
|  + Politeness    |
|    Queues)       |
+--------+---------+
         |
         v
+--------+---------+
|   DNS Resolver   |
|   (Cached)       |
+--------+---------+
         |
         v
+--------+---------+
|   Fetcher        |
|   (HTTP Client)  |
+--------+---------+
         |
         v
+--------+---------+
|   Content Parser |
|   (HTML Parser)  |
+--------+---------+
         |
    +----+----+
    |         |
    v         v
+---+---+ +---+--------+
| Dedup | | Link       |
| Check | | Extractor  |
+---+---+ +---+--------+
    |         |
    v         v
+---+------+  |
| Content  |  |
| Store    |  +---> URL Frontier (loop back)
+----------+

4. BFS Traversal

Why BFS Over DFS?

BFS (Breadth-First Search):
  Level 0: seed URLs (high authority sites)
  Level 1: pages linked from seeds
  Level 2: pages linked from level 1
  ...

  Advantages:
  - Discovers important pages first (closer to seed = higher authority)
  - Broader coverage earlier
  - Natural priority ordering

DFS (Depth-First Search):
  Follows one path deep before backtracking.

  Disadvantages:
  - May go deep into a single site before exploring others
  - Susceptible to spider traps (infinite depth)
  - Poor coverage distribution

Modified BFS with Priority

Not pure BFS -- use priority-based BFS:

  Priority factors:
  1. PageRank / domain authority
  2. Update frequency (news sites > static pages)
  3. Depth from seed (shallower = higher priority)
  4. Content type (HTML > other)

  Priority Queue dequeues highest-priority URL next

5. URL Frontier

The URL frontier is the core data structure -- it manages what to crawl next.

Two-Level Architecture

+------------------------------------------+
|            URL Frontier                   |
+------------------------------------------+
|                                          |
|   Front Queues (Priority)                |
|   +------+ +------+ +------+            |
|   | High | | Med  | | Low  |            |
|   |Prior.| |Prior.| |Prior.|            |
|   +--+---+ +--+---+ +--+---+            |
|      |        |        |                 |
|      +--------+--------+                 |
|               |                          |
|   Back Queues (Politeness)               |
|   +--------+ +--------+ +--------+      |
|   |cnn.com | |bbc.com | |wiki.org|      |
|   |queue   | |queue   | |queue   |      |
|   +--+-----+ +--+-----+ +--+-----+      |
|      |           |           |           |
+------------------------------------------+
       |           |           |
       v           v           v
    Fetcher     Fetcher     Fetcher
    Worker 1    Worker 2    Worker 3

Priority Queue (Front)

Assigns priority based on:
  - Domain authority / PageRank
  - Content freshness signal
  - Business priority (e.g., news sites)
  - Recrawl urgency

Queues:
  High priority: news sites, frequently updated, high PageRank
  Medium priority: regular content sites, blogs
  Low priority: deep pages, low PageRank, rarely changing

Politeness Queue (Back)

One queue per domain:
  Each domain has a minimum crawl interval (e.g., 1 request/second)

Domain-to-queue mapping:
  hash(domain) % N_queues --> queue assignment

Each fetcher worker:
  1. Pick a queue
  2. Check last fetch time for that domain
  3. If enough time elapsed: dequeue and fetch
  4. Otherwise: wait or pick another queue

This ensures:
  - No domain is overwhelmed
  - Crawl-delay from robots.txt is respected
  - Parallelism across different domains

Interview Tip: The two-level frontier (priority + politeness) is the key architectural insight. It balances "what's important to crawl" with "what's safe to crawl now."


6. DNS Resolver Cache

Problem

Each URL fetch requires a DNS lookup. At 12K fetches/sec, this creates 12K DNS queries/sec.

Solution

DNS Resolution Pipeline:

  URL: https://www.example.com/page.html
         |
         v
  Local DNS Cache (in-memory, TTL-based)
         |
    +----+----+
    |         |
  HIT       MISS
    |         |
    v         v
  Return   Query DNS Server
  cached     |
  IP         v
           Cache result (TTL = min(DNS TTL, 24h))
           Return IP

Cache Design

PropertyValue
StructureHash map: domain -> (IP, expiry)
Size~10M domains = ~500MB
Hit rate~90%+ (most fetches are repeat domains)
TTLmin(DNS record TTL, 24 hours)
RefreshBackground refresh before expiry

Batch DNS Resolution

  • Pre-resolve domains from the frontier in batches
  • Use multiple DNS servers for redundancy
  • Implement DNS-over-HTTPS for reliability

7. Content Parser

Raw HTML
    |
    v
+---+-------------+
| HTML Parser     |
| (jsoup, lxml)   |
+---+-------------+
    |
    +------+------+------+
    |      |      |      |
    v      v      v      v
  Text   Meta   Links  Media
  Content data   URLs   URLs
    |      |      |      |
    v      v      v      v
  Store  Store  Feed   Optional
  for    for    back   media
  index  index  to     crawl
               frontier

Parsing Steps

  1. Encoding detection: Determine charset (UTF-8, ISO-8859-1, etc.)
  2. HTML parsing: Build DOM tree, handle malformed HTML
  3. Text extraction: Strip tags, get visible text content
  4. Metadata extraction: title, description, keywords, language, author
  5. Link extraction: href attributes, canonical URLs, redirect targets
  6. URL normalization: resolve relative URLs, remove fragments, lowercase

URL Normalization

Input URLs that are all the same page:
  https://Example.COM/path/page.html
  https://example.com/path/page.html#section
  https://example.com/path/../path/page.html
  https://example.com/path/page.html?utm_source=twitter

Normalized:
  https://example.com/path/page.html

Rules:
  1. Lowercase scheme and host
  2. Remove fragment (#...)
  3. Resolve relative paths (../)
  4. Remove tracking parameters (utm_*)
  5. Remove default ports (:80, :443)
  6. Remove trailing slash (configurable)

8. Duplicate Detection

URL Deduplication (Bloom Filter)

Before adding URL to frontier, check if already seen:

Bloom Filter:
  - Probabilistic "set membership" test
  - False positives possible (skip some unseen URLs -- acceptable)
  - False negatives impossible (never re-crawl seen URL -- guaranteed)
  - Space efficient: 1% FP rate for 10B URLs = ~12GB

  Insert: hash URL with k hash functions, set k bits
  Query:  hash URL, check if all k bits set
    All set --> probably seen (skip)
    Any unset --> definitely not seen (add to frontier)

Content Deduplication (SimHash)

Problem: Different URLs can serve identical or near-identical content
  - Mirror sites: mirror.example.com vs example.com
  - Print versions: example.com/article vs example.com/article?print=1
  - Pagination: same content across page 1 and page 2 intro

SimHash (Charikar's technique):
  1. Extract features (word n-grams) from page
  2. Hash each feature to a 64-bit value
  3. Weight and combine hashes into a 64-bit fingerprint
  4. Compare fingerprints: Hamming distance < 3 = near-duplicate

Example:
  Page A fingerprint: 1010110001...  (64 bits)
  Page B fingerprint: 1010110011...  (64 bits)
  Hamming distance: 1 (only 1 bit differs)
  --> Near-duplicate detected!

Dedup Decision Matrix

Hamming DistanceInterpretationAction
0Exact duplicateSkip entirely
1-3Near-duplicateSkip or store one canonical
4-10Similar but distinctCrawl, may cluster
> 10Different contentCrawl normally

Interview Tip: Know that Bloom filters handle URL dedup (billions of URLs, need space efficiency) and SimHash handles content dedup (detect mirror/duplicate pages).


9. Distributed Architecture

Multi-Worker Design

                    +------------------+
                    |   Coordinator    |
                    | (URL assignment, |
                    |  monitoring)     |
                    +--------+---------+
                             |
            +----------------+----------------+
            |                |                |
     +------+------+  +------+------+  +------+------+
     | Worker 1    |  | Worker 2    |  | Worker N    |
     | (Region: US)|  | (Region: EU)|  | (Region:Asia|
     +------+------+  +------+------+  +------+------+
            |                |                |
     +------+------+  +------+------+  +------+------+
     | Local       |  | Local       |  | Local       |
     | Frontier    |  | Frontier    |  | Frontier    |
     +------+------+  +------+------+  +------+------+
            |                |                |
     +------+------+  +------+------+  +------+------+
     | DNS Cache   |  | DNS Cache   |  | DNS Cache   |
     +-------------+  +-------------+  +-------------+

URL Partitioning

Strategy: Partition by domain hash

  hash("cnn.com") % N_workers = worker_3
  hash("bbc.com") % N_workers = worker_1

Benefits:
  - Each worker "owns" a set of domains
  - Politeness is naturally enforced per worker
  - DNS caches are per-worker (high hit rate)
  - robots.txt cached per worker

Alternative: Geographic partitioning
  - US workers crawl .com, .us domains
  - EU workers crawl .co.uk, .de, .fr domains
  - Lower latency for fetching

Scaling

ComponentScaling Strategy
FrontierDistributed queue (Redis, Kafka)
FetcherAdd more workers (horizontal)
Content storeDistributed storage (HDFS, S3)
Dedup (Bloom)Partitioned Bloom filters
DNS cachePer-worker local cache

10. robots.txt Respect

Caching Strategy

Before crawling any URL on a domain:
  1. Check robots.txt cache for domain
  2. If cached and fresh: apply rules
  3. If not cached or stale:
     a. Fetch robots.txt (with timeout)
     b. Parse and cache (TTL = 24 hours)
     c. Apply rules

robots.txt fetch failures:
  - 4xx (not found): assume no restrictions
  - 5xx (server error): assume full restriction (conservative)
  - Timeout: retry later, skip domain for now

robots.txt Parser

User-agent: *
Disallow: /admin/
Disallow: /api/
Crawl-delay: 2
Sitemap: https://example.com/sitemap.xml

Parsed rules:
  {
    "user-agent": "*",
    "disallowed": ["/admin/", "/api/"],
    "crawl_delay": 2,
    "sitemaps": ["https://example.com/sitemap.xml"]
  }

Before fetching https://example.com/admin/login:
  Path "/admin/login" matches disallow "/admin/" --> SKIP

11. Handling Traps and Edge Cases

Spider Traps

Trap: Infinite URL generation

  example.com/page/1
  example.com/page/1/2
  example.com/page/1/2/3
  example.com/page/1/2/3/4
  ... (infinite depth)

Calendar trap:
  example.com/calendar/2024/01/01
  example.com/calendar/2024/01/02
  ... (infinite dates)

Session ID trap:
  example.com/page?session=abc123
  example.com/page?session=def456
  ... (infinite unique URLs, same content)

Countermeasures

Trap TypeDefense
Infinite depthMax URL depth limit (e.g., 15 levels)
Calendar/date URLsDetect date patterns, limit range
Session IDsURL normalization (strip session params)
Query permutationsCanonical URL extraction
Redirect loopsMax redirect count (e.g., 10)
Slow serversRequest timeout (30 seconds)
Extremely large pagesMax page size limit (10MB)

URL Depth Limiting

Seed URL: https://example.com/ (depth 0)
  |
  +-- /about (depth 1)
  |     +-- /about/team (depth 2)
  |           +-- /about/team/john (depth 3)
  |                 +-- ... (depth > max_depth: STOP)
  |
  +-- /products (depth 1)
        +-- /products/widget (depth 2)

Domain-Level Throttling

If a domain generates > 10K unique URLs: flag for review
If content fingerprints are all similar: likely a trap
If average page quality score drops: reduce priority

12. Freshness: Recrawl Scheduling

The Challenge

Cannot recrawl all 100B pages every day. Must prioritize.

Recrawl Priority Formula

recrawl_priority = importance x staleness x change_probability

importance:        PageRank, domain authority
staleness:         time since last crawl
change_probability: historical change frequency of the page

Change Frequency Estimation

Track historical crawl data:

  Page A: changed 8 of last 10 crawls
    --> change_freq = 0.8 (crawl frequently)

  Page B: changed 1 of last 10 crawls
    --> change_freq = 0.1 (crawl rarely)

Adaptive scheduling:
  Page A: recrawl every 1 hour
  Page B: recrawl every 30 days

Freshness Categories

CategoryRecrawl IntervalExamples
Real-timeMinutesNews front pages, live scores
FrequentHoursNews articles, social media
RegularDaysBlog posts, product pages
RareWeeks-MonthsStatic pages, documentation
ArchiveNever (manual)Dead pages, historical content

Signals for Recrawl Urgency

  • RSS/Atom feed updates (subscribe to site feeds)
  • Sitemap lastmod timestamps
  • HTTP Last-Modified / ETag headers
  • Social media mentions (trending topics -> recrawl related pages)
  • Webmaster tools ping (site owners notify of updates)

13. Scaling to Billions

Performance Optimizations

OptimizationImpact
HTTP keep-aliveReuse connections per domain, reduce TCP overhead
Parallel fetching100s of concurrent connections per worker
Compressed transferAccept gzip/deflate, reduce bandwidth 70%
Conditional GETIf-Modified-Since header, skip unchanged pages
Head-of-line blocking avoidanceHTTP/2 multiplexing
Connection poolingPool per domain, limit connections

Storage Architecture

Crawled Content Pipeline:

  Fetcher --> Raw HTML --> Content Store (S3/HDFS)
                              |
                              v
                        +-----+------+
                        | MapReduce  |
                        | Processing |
                        +-----+------+
                              |
                   +----------+----------+
                   |          |          |
              Parse &    Build      Compute
              Extract    Inverted   PageRank
              Text       Index

Monitoring and Health

Key metrics to track:
  - Pages crawled per second (throughput)
  - Error rate by type (4xx, 5xx, timeout, DNS failure)
  - Average fetch latency per domain
  - Frontier size (growing = discovery, shrinking = exhaustion)
  - Duplicate rate (too high = wasted work)
  - robots.txt block rate per domain
  - Content freshness score (how stale is our index?)

14. Complete System Diagram

+-------------+
|  Seed URLs  |
+------+------+
       |
       v
+------+------+------+
|    URL Frontier     |
| +-------+ +------+ |
| |Priority| |Polite| |
| |Queues  | |Queues| |
| +-------+ +------+ |
+------+--------------+
       |
       v
+------+------+                    +------------------+
| URL Dedup   |                    | robots.txt Cache |
| (Bloom      |                    | (per domain)     |
| Filter)     |                    +--------+---------+
+------+------+                             |
       |                                    |
       v                                    v
+------+------+------+------+------+--------+---------+
|                Fetcher Workers (distributed)         |
|  +--------+  +--------+  +--------+  +--------+     |
|  |Worker 1|  |Worker 2|  |Worker 3|  |Worker N|     |
|  +---+----+  +---+----+  +---+----+  +---+----+     |
+------+------------+------------+------------+--------+
       |            |            |            |
       +------+-----+-----+-----+            |
              |           |                   |
       +------+------+    |            +------+------+
       | DNS Cache   |    |            | HTTP        |
       | (in-memory) |    |            | Connection  |
       +-------------+    |            | Pool        |
                          |            +-------------+
                          v
               +----------+----------+
               |   Content Parser    |
               | +------+ +-------+ |
               | |Text  | |Link   | |
               | |Extract| |Extract| |
               | +------+ +-------+ |
               +----------+----------+
                    |          |
              +-----+    +----+
              |          |
              v          v
       +------+---+ +---+--------+
       | Content  | | URL        |
       | Dedup    | | Normalizer |
       | (SimHash)| +---+--------+
       +------+---+     |
              |          v
              v     Back to URL Frontier
       +------+------+
       | Content     |
       | Store       |
       | (S3 / HDFS) |
       +------+------+
              |
              v
       +------+------+
       | Index       |
       | Pipeline    |
       | (MapReduce) |
       +-------------+

15. Interview Tips

  1. URL frontier is THE core: Two-level design (priority + politeness) is essential
  2. BFS not DFS: Explain why breadth-first discovers important pages faster
  3. Bloom filter for URL dedup: Know the math -- 10B URLs, 1% FP = ~12GB
  4. SimHash for content dedup: Fingerprint comparison, Hamming distance threshold
  5. Politeness is not optional: One request per domain per second is the baseline
  6. robots.txt: Cache per domain, handle failure modes gracefully
  7. Spider traps: Depth limits, URL normalization, domain-level anomaly detection
  8. Freshness is a scheduling problem: Prioritize recrawl based on change frequency
  9. Scale numbers matter: Know 1B pages/day = ~12K pages/sec = ~10Gbps
  10. Don't forget DNS caching: It's a hidden bottleneck at crawl scale

16. Resources

  • Alex Xu - System Design Interview Vol. 1, Chapter 9: Design a Web Crawler
  • "The Anatomy of a Large-Scale Hypertextual Web Search Engine" (Brin & Page, 1998)
  • "IRLbot: Scaling to 6 Billion Pages and Beyond" (Lee et al.)
  • "Web Crawling" chapter in Introduction to Information Retrieval (Manning et al.)
  • Mercator: A Scalable, Extensible Web Crawler (Heydon & Najork, 1999)
  • Bloom Filter tutorial (brilliant.org/wiki/bloom-filter)
  • SimHash paper: "Detecting Near-Duplicates for Web Crawling" (Manku et al., 2007)
  • Martin Kleppmann - Designing Data-Intensive Applications, Chapter 10 (Batch Processing)

Previous: 37 - Design Dropbox or Google Drive | Next: 39 - Design Notification System