34 - Design Google Search

Series: System Design & Distributed Systems Previous: 33 - Design YouTube or Netflix | Next: 35 - Design Uber or Lyft


1. Requirements

Functional Requirements

FeatureDetails
Web searchReturn ranked results for text queries
AutocompleteSuggest queries as user types
Web crawlingDiscover and index the web continuously
Spell correction"Did you mean..." suggestions
Knowledge panelStructured answers for entities
Image/video searchMultimedia results
AdsSponsored results alongside organic

Non-Functional Requirements

  • Latency: < 200ms end-to-end for search results
  • Availability: 99.999%
  • Scale: 8.5B searches/day, index 100B+ web pages
  • Freshness: Breaking news indexed within minutes
  • Relevance: Most useful results on the first page

2. Capacity Estimation

Searches/day:           8.5B
Search QPS:             ~100K QPS (avg), ~200K peak
Index size:             100B+ pages
Avg page size:          100KB (raw HTML)
Raw web storage:        100B x 100KB = 10PB
Inverted index size:    ~1PB (compressed)
Crawl rate:             ~1B pages/day recrawl
ResourceEstimate
Search QPS~100K
Index size~1PB compressed
Crawl bandwidth~100TB/day
Serving infrastructure1M+ servers (Google scale)

3. System Overview: Three Major Subsystems

+-------------------+    +-------------------+    +-------------------+
|   WEB CRAWLER     |    |   INDEXER          |    |   SEARCH SERVING  |
|                   |    |                   |    |                   |
| Discover & fetch  |--->| Parse, analyze,   |--->| Query processing, |
| web pages         |    | build inverted    |    | ranking, serving  |
|                   |    | index             |    | results           |
+-------------------+    +-------------------+    +-------------------+

4. Web Crawler

BFS Crawling Strategy

Seed URLs
   |
   v
+--+-----------+
| URL Frontier |  (priority queue of URLs to crawl)
+--+-----------+
   |
   v
+--+-----------+
| DNS Resolver |  (cached, batched lookups)
+--+-----------+
   |
   v
+--+-----------+
| Fetcher      |  (HTTP GET, respect robots.txt)
+--+-----------+
   |
   v
+--+-----------+
| Content      |  (extract text, links, metadata)
| Parser       |
+--+-----------+
   |
   +-----------+------------------+
   |           |                  |
   v           v                  v
+--+---+  +---+----------+  +----+-------+
|Dedup |  |Link Extractor|  |Content     |
|Check |  |              |  |Store       |
+--+---+  +---+----------+  +------------+
   |           |
   v           v
 Skip if    Add new URLs
 already    to Frontier
 seen

URL Frontier: Priority + Politeness

                    URL Frontier
                         |
        +----------------+----------------+
        |                                 |
  Priority Queues                   Politeness Queues
  (by page importance)              (per domain rate limit)
        |                                 |
  +-----+-----+                 +---------+---------+
  |     |     |                 |         |         |
  High  Med   Low             cnn.com  bbc.com  wiki.org
  (news (blog (deep            (1 req/  (1 req/  (1 req/
   home  posts links)           sec)     sec)     sec)
   pages)

robots.txt Compliance

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2

User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml
  • Cache robots.txt per domain (refresh daily)
  • Respect Crawl-delay directive
  • Parse sitemaps for discovery hints

Deduplication

MethodPurposeTechnique
URL dedupAvoid re-fetching same URLBloom filter (approximate), hash table (exact)
Content dedupDetect mirror/duplicate pagesSimHash or MinHash fingerprinting
Near-duplicateSimilar but not identical pagesSimHash with Hamming distance threshold

Interview Tip: Mention that a Bloom filter with 1% false positive rate for 10B URLs needs ~12GB of memory -- very practical.


5. Indexer: Inverted Index Construction

What Is an Inverted Index?

Forward Index (document -> words):
  Doc1: "the quick brown fox"
  Doc2: "the lazy brown dog"
  Doc3: "quick fox jumps"

Inverted Index (word -> documents):
  "the"   -> [Doc1:pos0, Doc2:pos0]
  "quick" -> [Doc1:pos1, Doc3:pos0]
  "brown" -> [Doc1:pos2, Doc2:pos2]
  "fox"   -> [Doc1:pos3, Doc3:pos1]
  "lazy"  -> [Doc2:pos1]
  "dog"   -> [Doc2:pos3]
  "jumps" -> [Doc3:pos2]

Index Entry Structure

term -> {
  doc_freq: 3,                    // number of documents containing term
  postings: [
    { doc_id: 1, tf: 2, positions: [0, 15] },
    { doc_id: 7, tf: 1, positions: [42] },
    { doc_id: 23, tf: 5, positions: [3, 8, 19, 33, 55] }
  ]
}

MapReduce for Index Building

Phase 1: MAP
  Input: Raw HTML pages
  Output: (term, {doc_id, position, metadata})

Phase 2: SHUFFLE
  Group by term

Phase 3: REDUCE
  Input: (term, [{doc_id, position}...])
  Output: (term, sorted posting list)

Phase 4: MERGE
  Combine reducer outputs into final index shards

6. PageRank Algorithm

Concept

A page is important if important pages link to it.

Page A (PR=0.5) -----> Page C
Page B (PR=0.3) -----> Page C
Page D (PR=0.8) -----> Page C

Page C receives PageRank from A, B, D:
PR(C) = d * (PR(A)/outlinks(A) + PR(B)/outlinks(B) + PR(D)/outlinks(D)) + (1-d)/N

where d = damping factor (0.85), N = total pages

Iterative Computation

Initialize: All pages PR = 1/N
Repeat until convergence:
  For each page P:
    PR(P) = (1-d)/N + d * SUM(PR(Q)/L(Q)) for all Q linking to P
    where L(Q) = number of outgoing links from Q

Modern Ranking Signals (Beyond PageRank)

SignalWeightDescription
PageRankMediumLink-based authority
Content relevanceHighBM25 / TF-IDF match to query
Click-through rateHighHow often users click this result
FreshnessMediumRecency of content
Mobile friendlinessMediumResponsive design
Page speedMediumCore Web Vitals
Domain authorityMediumOverall domain trust
HTTPSLowSecure connection
User engagementHighDwell time, bounce rate

7. Query Processing

Query Flow

User: "best coffee shops in NYC"
       |
       v
+------+--------+
| Query Parser  |
| - Tokenize    |
| - Remove stop |
|   words       |
| - Stem/lemma  |
+------+--------+
       |
       v
  ["best", "coffee", "shop", "NYC"]
       |
       v
+------+--------+
| Spell Check   |
| "coffe" ->    |
|  "coffee"     |
+------+--------+
       |
       v
+------+--------+
| Query         |
| Expansion     |
| "NYC" ->      |
| "New York     |
|  City"        |
+------+--------+
       |
       v
+------+----------+
| Index Lookup    |
| (parallel across|
|  shards)        |
+------+----------+
       |
       v
+------+----------+
| Scoring &       |
| Ranking         |
| (BM25 + signals)|
+------+----------+
       |
       v
+------+----------+
| Re-ranking      |
| (ML model,      |
|  personalization|
|  diversity)     |
+------+----------+
       |
       v
  Top 10 results + snippets + knowledge panel

Spell Correction

Approach 1: Edit Distance
  "cofee" -> candidates within edit distance 1-2
  -> "coffee" (1 edit), "cone" (2 edits)
  -> Pick most likely based on query frequency

Approach 2: Noisy Channel Model
  P(correction | query) = P(query | correction) * P(correction)
  - P(correction): language model (how common is the word)
  - P(query | correction): error model (how likely is this typo)

8. Search Serving Architecture

Index Sharding

100B documents split across shards:

Document Space Partitioning:
  Shard 1: docs 1 - 10B
  Shard 2: docs 10B - 20B
  ...
  Shard 10: docs 90B - 100B

Each shard has replicas for availability:
  Shard 1: [Replica A, Replica B, Replica C]

Scatter-Gather Pattern

Query Router
    |
    +-- Query Shard 1, Replica A --> Top 100 results
    +-- Query Shard 2, Replica B --> Top 100 results
    +-- Query Shard 3, Replica C --> Top 100 results
    ...
    +-- Query Shard 10, Replica A -> Top 100 results
    |
    v
Merge & Re-rank top 1000 --> Return top 10

Caching Layers

Cache LevelWhat's CachedHit Rate
Result cacheFull SERP for popular queries~30%
Posting list cacheFrequently accessed terms~80%
Document cachePopular document metadata~60%
DNS cacheCrawler DNS lookups~95%

Interview Tip: Mention that ~30% of queries are repeated (head queries). Caching full results for these gives massive savings.


9. Autocomplete

User types: "how to ma"
       |
       v
+------+----------+
| Trie / Prefix   |
| Index            |
| (in-memory)     |
+------+----------+
       |
       v
Candidates:
  "how to make pancakes"     (freq: 50K/day)
  "how to make money online" (freq: 45K/day)
  "how to make a resume"     (freq: 30K/day)
       |
       v
+------+----------+
| Ranking         |
| - Query freq    |
| - Freshness     |
| - Personalized  |
+------+----------+
       |
       v
Top 5 suggestions

Data Structure: Trie with Top-K

Root
 |-- h
 |   |-- o
 |   |   |-- w
 |   |   |   |-- [top5: "how to...", "how old...", ...]
 |   |   |   |-- _ (space)
 |   |   |       |-- t
 |   |   |           |-- o
 |   |   |               |-- [top5: "how to make...", ...]
  • Pre-compute top 5 suggestions at each trie node
  • Update asynchronously from query logs
  • Shard by prefix range for distribution

10. Knowledge Graph Panel

Query: "Albert Einstein"
       |
       v
Entity Recognition: identifies "Albert Einstein" as a Person entity
       |
       v
Knowledge Graph Lookup:
  {
    name: "Albert Einstein",
    type: "Person",
    born: "1879-03-14",
    died: "1955-04-18",
    fields: ["Physics", "Mathematics"],
    known_for: ["Theory of Relativity", "E=mc^2"],
    image: "einstein.jpg",
    related: ["Niels Bohr", "Max Planck"]
  }
       |
       v
Render as structured panel alongside search results
  • Knowledge Graph: billions of entities and relationships
  • Sources: Wikipedia, Freebase, structured web data
  • Entity linking: map query terms to canonical entities

11. Freshness: Real-Time Indexing

Two Indexing Pipelines

Batch Pipeline (hours-to-days):
  Crawl --> Parse --> MapReduce --> Full Index Rebuild

Real-Time Pipeline (seconds-to-minutes):
  Breaking news / trending content
       |
       v
  Priority Crawl Queue --> Fast Fetch --> Incremental Indexer
       |                                        |
       v                                        v
  Change Detection                        Real-Time Index
  (RSS feeds, Twitter,                    (small, frequently
   news sitemaps)                          merged with main)

Serving Both Indexes

Query --> Search Router
              |
    +---------+---------+
    |                   |
Main Index          Real-Time Index
(100B docs,         (1M docs,
 updated daily)      updated per minute)
    |                   |
    +---------+---------+
              |
         Merge results, re-rank

12. Ads Integration (Brief)

Search Results Page:
  [Ad] Best Coffee Shops - Visit CoffeeGuide.com
  [Ad] NYC Coffee Tour - Book Now
  ---
  1. Organic result: "Top 10 Coffee Shops in NYC - TimeOut"
  2. Organic result: "Best Coffee Near Me - Yelp"
  ...
  • Ad auction runs in parallel with organic search
  • Advertisers bid on keywords
  • Ranking: bid amount x quality score x expected CTR
  • Ads are clearly labeled, separate from organic results
  • Revenue model: cost-per-click (CPC)

13. Complete System Diagram

+-------------+
|   Users     |
+------+------+
       |
+------+------+
| DNS / CDN   |
+------+------+
       |
+------+--------+
| Load Balancer |
+------+--------+
       |
+------+--------+--------+--------+
|               |        |        |
v               v        v        v
+--------+ +--------+ +-----+ +------+
|Search  | |Auto-   | |Ads  | |Image |
|Frontend| |complete| |Svc  | |Search|
+---+----+ +---+----+ +--+--+ +------+
    |          |          |
    v          v          |
+---+----+ +--+------+   |
|Query   | |Trie     |   |
|Process.| |Index    |   |
+---+----+ +---------+   |
    |                     |
    v                     |
+---+-----------+         |
| Index Serving |<--------+
| (Scatter-     |
|  Gather)      |
+---+-----------+
    |
+---+---+---+---+---+
|   |   |   |   |   |
S1  S2  S3  S4 ... SN    (Index Shards, each with replicas)

=== OFFLINE PIPELINE ===

+-----------+     +----------+     +-----------+
| Web       |---->| Content  |---->| Indexer   |
| Crawler   |     | Store    |     | (MapReduce|
| (BFS +    |     | (raw     |     |  + increm)|
|  frontier)|     |  pages)  |     +-----+-----+
+-----------+     +----------+           |
                                   +-----+-----+
                                   | Inverted  |
                                   | Index     |
                                   | Shards    |
                                   +-----------+

14. Interview Tips

  1. Three subsystems: Crawler, Indexer, Search Serving -- structure your answer this way
  2. Inverted index is the core: Be ready to explain it with a concrete example
  3. PageRank is just one signal: Modern search uses hundreds of ranking features
  4. Sharding strategy: Document partitioning vs term partitioning (document is standard)
  5. Freshness matters: Explain the dual pipeline (batch + real-time)
  6. Autocomplete is a separate system: Trie-based, pre-computed, updated asynchronously
  7. Don't forget caching: Result cache for head queries saves enormous compute
  8. Politeness in crawling: Rate limiting per domain, robots.txt -- shows maturity

15. Resources

  • Alex Xu - System Design Interview Vol. 2, Chapter 2: Design Google Search
  • "The Anatomy of a Large-Scale Hypertextual Web Search Engine" (Brin & Page, 1998)
  • Google "How Search Works" (developers.google.com/search/docs/fundamentals)
  • "Web Search for a Planet: The Google Cluster Architecture" (Barroso et al., 2003)
  • "Introduction to Information Retrieval" (Manning, Raghavan, Schutze) - free online
  • Google Caffeine: Real-time web indexing architecture
  • Martin Kleppmann - Designing Data-Intensive Applications, Chapter 3 (Storage and Retrieval)

Previous: 33 - Design YouTube or Netflix | Next: 35 - Design Uber or Lyft