34 - Design Google Search

Series: System Design & Distributed Systems Previous: 33 - Design YouTube or Netflix | Next: 35 - Design Uber or Lyft

1. Requirements

Functional Requirements

Feature	Details
Web search	Return ranked results for text queries
Autocomplete	Suggest queries as user types
Web crawling	Discover and index the web continuously
Spell correction	"Did you mean..." suggestions
Knowledge panel	Structured answers for entities
Image/video search	Multimedia results
Ads	Sponsored results alongside organic

Non-Functional Requirements

Latency: < 200ms end-to-end for search results
Availability: 99.999%
Scale: 8.5B searches/day, index 100B+ web pages
Freshness: Breaking news indexed within minutes
Relevance: Most useful results on the first page

2. Capacity Estimation

Searches/day:           8.5B
Search QPS:             ~100K QPS (avg), ~200K peak
Index size:             100B+ pages
Avg page size:          100KB (raw HTML)
Raw web storage:        100B x 100KB = 10PB
Inverted index size:    ~1PB (compressed)
Crawl rate:             ~1B pages/day recrawl

Resource	Estimate
Search QPS	~100K
Index size	~1PB compressed
Crawl bandwidth	~100TB/day
Serving infrastructure	1M+ servers (Google scale)

3. System Overview: Three Major Subsystems

+-------------------+    +-------------------+    +-------------------+
|   WEB CRAWLER     |    |   INDEXER          |    |   SEARCH SERVING  |
|                   |    |                   |    |                   |
| Discover & fetch  |--->| Parse, analyze,   |--->| Query processing, |
| web pages         |    | build inverted    |    | ranking, serving  |
|                   |    | index             |    | results           |
+-------------------+    +-------------------+    +-------------------+

4. Web Crawler

BFS Crawling Strategy

Seed URLs
   |
   v
+--+-----------+
| URL Frontier |  (priority queue of URLs to crawl)
+--+-----------+
   |
   v
+--+-----------+
| DNS Resolver |  (cached, batched lookups)
+--+-----------+
   |
   v
+--+-----------+
| Fetcher      |  (HTTP GET, respect robots.txt)
+--+-----------+
   |
   v
+--+-----------+
| Content      |  (extract text, links, metadata)
| Parser       |
+--+-----------+
   |
   +-----------+------------------+
   |           |                  |
   v           v                  v
+--+---+  +---+----------+  +----+-------+
|Dedup |  |Link Extractor|  |Content     |
|Check |  |              |  |Store       |
+--+---+  +---+----------+  +------------+
   |           |
   v           v
 Skip if    Add new URLs
 already    to Frontier
 seen

URL Frontier: Priority + Politeness

                    URL Frontier
                         |
        +----------------+----------------+
        |                                 |
  Priority Queues                   Politeness Queues
  (by page importance)              (per domain rate limit)
        |                                 |
  +-----+-----+                 +---------+---------+
  |     |     |                 |         |         |
  High  Med   Low             cnn.com  bbc.com  wiki.org
  (news (blog (deep            (1 req/  (1 req/  (1 req/
   home  posts links)           sec)     sec)     sec)
   pages)

robots.txt Compliance

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2

User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml

Cache robots.txt per domain (refresh daily)
Respect Crawl-delay directive
Parse sitemaps for discovery hints

Deduplication

Method	Purpose	Technique
URL dedup	Avoid re-fetching same URL	Bloom filter (approximate), hash table (exact)
Content dedup	Detect mirror/duplicate pages	SimHash or MinHash fingerprinting
Near-duplicate	Similar but not identical pages	SimHash with Hamming distance threshold

Interview Tip: Mention that a Bloom filter with 1% false positive rate for 10B URLs needs ~12GB of memory -- very practical.

5. Indexer: Inverted Index Construction

What Is an Inverted Index?

Forward Index (document -> words):
  Doc1: "the quick brown fox"
  Doc2: "the lazy brown dog"
  Doc3: "quick fox jumps"

Inverted Index (word -> documents):
  "the"   -> [Doc1:pos0, Doc2:pos0]
  "quick" -> [Doc1:pos1, Doc3:pos0]
  "brown" -> [Doc1:pos2, Doc2:pos2]
  "fox"   -> [Doc1:pos3, Doc3:pos1]
  "lazy"  -> [Doc2:pos1]
  "dog"   -> [Doc2:pos3]
  "jumps" -> [Doc3:pos2]

Index Entry Structure

term -> {
  doc_freq: 3,                    // number of documents containing term
  postings: [
    { doc_id: 1, tf: 2, positions: [0, 15] },
    { doc_id: 7, tf: 1, positions: [42] },
    { doc_id: 23, tf: 5, positions: [3, 8, 19, 33, 55] }
  ]
}

MapReduce for Index Building

Phase 1: MAP
  Input: Raw HTML pages
  Output: (term, {doc_id, position, metadata})

Phase 2: SHUFFLE
  Group by term

Phase 3: REDUCE
  Input: (term, [{doc_id, position}...])
  Output: (term, sorted posting list)

Phase 4: MERGE
  Combine reducer outputs into final index shards

6. PageRank Algorithm

Concept

A page is important if important pages link to it.

Page A (PR=0.5) -----> Page C
Page B (PR=0.3) -----> Page C
Page D (PR=0.8) -----> Page C

Page C receives PageRank from A, B, D:
PR(C) = d * (PR(A)/outlinks(A) + PR(B)/outlinks(B) + PR(D)/outlinks(D)) + (1-d)/N

where d = damping factor (0.85), N = total pages

Iterative Computation

Initialize: All pages PR = 1/N
Repeat until convergence:
  For each page P:
    PR(P) = (1-d)/N + d * SUM(PR(Q)/L(Q)) for all Q linking to P
    where L(Q) = number of outgoing links from Q

Modern Ranking Signals (Beyond PageRank)

Signal	Weight	Description
PageRank	Medium	Link-based authority
Content relevance	High	BM25 / TF-IDF match to query
Click-through rate	High	How often users click this result
Freshness	Medium	Recency of content
Mobile friendliness	Medium	Responsive design
Page speed	Medium	Core Web Vitals
Domain authority	Medium	Overall domain trust
HTTPS	Low	Secure connection
User engagement	High	Dwell time, bounce rate

7. Query Processing

Query Flow

User: "best coffee shops in NYC"
       |
       v
+------+--------+
| Query Parser  |
| - Tokenize    |
| - Remove stop |
|   words       |
| - Stem/lemma  |
+------+--------+
       |
       v
  ["best", "coffee", "shop", "NYC"]
       |
       v
+------+--------+
| Spell Check   |
| "coffe" ->    |
|  "coffee"     |
+------+--------+
       |
       v
+------+--------+
| Query         |
| Expansion     |
| "NYC" ->      |
| "New York     |
|  City"        |
+------+--------+
       |
       v
+------+----------+
| Index Lookup    |
| (parallel across|
|  shards)        |
+------+----------+
       |
       v
+------+----------+
| Scoring &       |
| Ranking         |
| (BM25 + signals)|
+------+----------+
       |
       v
+------+----------+
| Re-ranking      |
| (ML model,      |
|  personalization|
|  diversity)     |
+------+----------+
       |
       v
  Top 10 results + snippets + knowledge panel

Spell Correction

Approach 1: Edit Distance
  "cofee" -> candidates within edit distance 1-2
  -> "coffee" (1 edit), "cone" (2 edits)
  -> Pick most likely based on query frequency

Approach 2: Noisy Channel Model
  P(correction | query) = P(query | correction) * P(correction)
  - P(correction): language model (how common is the word)
  - P(query | correction): error model (how likely is this typo)

8. Search Serving Architecture

Index Sharding

100B documents split across shards:

Document Space Partitioning:
  Shard 1: docs 1 - 10B
  Shard 2: docs 10B - 20B
  ...
  Shard 10: docs 90B - 100B

Each shard has replicas for availability:
  Shard 1: [Replica A, Replica B, Replica C]

Scatter-Gather Pattern

Query Router
    |
    +-- Query Shard 1, Replica A --> Top 100 results
    +-- Query Shard 2, Replica B --> Top 100 results
    +-- Query Shard 3, Replica C --> Top 100 results
    ...
    +-- Query Shard 10, Replica A -> Top 100 results
    |
    v
Merge & Re-rank top 1000 --> Return top 10

Caching Layers

Cache Level	What's Cached	Hit Rate
Result cache	Full SERP for popular queries	~30%
Posting list cache	Frequently accessed terms	~80%
Document cache	Popular document metadata	~60%
DNS cache	Crawler DNS lookups	~95%

Interview Tip: Mention that ~30% of queries are repeated (head queries). Caching full results for these gives massive savings.

9. Autocomplete

User types: "how to ma"
       |
       v
+------+----------+
| Trie / Prefix   |
| Index            |
| (in-memory)     |
+------+----------+
       |
       v
Candidates:
  "how to make pancakes"     (freq: 50K/day)
  "how to make money online" (freq: 45K/day)
  "how to make a resume"     (freq: 30K/day)
       |
       v
+------+----------+
| Ranking         |
| - Query freq    |
| - Freshness     |
| - Personalized  |
+------+----------+
       |
       v
Top 5 suggestions

Data Structure: Trie with Top-K

Root
 |-- h
 |   |-- o
 |   |   |-- w
 |   |   |   |-- [top5: "how to...", "how old...", ...]
 |   |   |   |-- _ (space)
 |   |   |       |-- t
 |   |   |           |-- o
 |   |   |               |-- [top5: "how to make...", ...]

Pre-compute top 5 suggestions at each trie node
Update asynchronously from query logs
Shard by prefix range for distribution

10. Knowledge Graph Panel

Query: "Albert Einstein"
       |
       v
Entity Recognition: identifies "Albert Einstein" as a Person entity
       |
       v
Knowledge Graph Lookup:
  {
    name: "Albert Einstein",
    type: "Person",
    born: "1879-03-14",
    died: "1955-04-18",
    fields: ["Physics", "Mathematics"],
    known_for: ["Theory of Relativity", "E=mc^2"],
    image: "einstein.jpg",
    related: ["Niels Bohr", "Max Planck"]
  }
       |
       v
Render as structured panel alongside search results

Knowledge Graph: billions of entities and relationships
Sources: Wikipedia, Freebase, structured web data
Entity linking: map query terms to canonical entities

11. Freshness: Real-Time Indexing

Two Indexing Pipelines

Batch Pipeline (hours-to-days):
  Crawl --> Parse --> MapReduce --> Full Index Rebuild

Real-Time Pipeline (seconds-to-minutes):
  Breaking news / trending content
       |
       v
  Priority Crawl Queue --> Fast Fetch --> Incremental Indexer
       |                                        |
       v                                        v
  Change Detection                        Real-Time Index
  (RSS feeds, Twitter,                    (small, frequently
   news sitemaps)                          merged with main)

Serving Both Indexes

Query --> Search Router
              |
    +---------+---------+
    |                   |
Main Index          Real-Time Index
(100B docs,         (1M docs,
 updated daily)      updated per minute)
    |                   |
    +---------+---------+
              |
         Merge results, re-rank

12. Ads Integration (Brief)

Search Results Page:
  [Ad] Best Coffee Shops - Visit CoffeeGuide.com
  [Ad] NYC Coffee Tour - Book Now
  ---
  1. Organic result: "Top 10 Coffee Shops in NYC - TimeOut"
  2. Organic result: "Best Coffee Near Me - Yelp"
  ...

Ad auction runs in parallel with organic search
Advertisers bid on keywords
Ranking: bid amount x quality score x expected CTR
Ads are clearly labeled, separate from organic results
Revenue model: cost-per-click (CPC)

13. Complete System Diagram

+-------------+
|   Users     |
+------+------+
       |
+------+------+
| DNS / CDN   |
+------+------+
       |
+------+--------+
| Load Balancer |
+------+--------+
       |
+------+--------+--------+--------+
|               |        |        |
v               v        v        v
+--------+ +--------+ +-----+ +------+
|Search  | |Auto-   | |Ads  | |Image |
|Frontend| |complete| |Svc  | |Search|
+---+----+ +---+----+ +--+--+ +------+
    |          |          |
    v          v          |
+---+----+ +--+------+   |
|Query   | |Trie     |   |
|Process.| |Index    |   |
+---+----+ +---------+   |
    |                     |
    v                     |
+---+-----------+         |
| Index Serving |<--------+
| (Scatter-     |
|  Gather)      |
+---+-----------+
    |
+---+---+---+---+---+
|   |   |   |   |   |
S1  S2  S3  S4 ... SN    (Index Shards, each with replicas)

=== OFFLINE PIPELINE ===

+-----------+     +----------+     +-----------+
| Web       |---->| Content  |---->| Indexer   |
| Crawler   |     | Store    |     | (MapReduce|
| (BFS +    |     | (raw     |     |  + increm)|
|  frontier)|     |  pages)  |     +-----+-----+
+-----------+     +----------+           |
                                   +-----+-----+
                                   | Inverted  |
                                   | Index     |
                                   | Shards    |
                                   +-----------+

14. Interview Tips

Three subsystems: Crawler, Indexer, Search Serving -- structure your answer this way
Inverted index is the core: Be ready to explain it with a concrete example
PageRank is just one signal: Modern search uses hundreds of ranking features
Sharding strategy: Document partitioning vs term partitioning (document is standard)
Freshness matters: Explain the dual pipeline (batch + real-time)
Autocomplete is a separate system: Trie-based, pre-computed, updated asynchronously
Don't forget caching: Result cache for head queries saves enormous compute
Politeness in crawling: Rate limiting per domain, robots.txt -- shows maturity

15. Resources

Alex Xu - System Design Interview Vol. 2, Chapter 2: Design Google Search
"The Anatomy of a Large-Scale Hypertextual Web Search Engine" (Brin & Page, 1998)
Google "How Search Works" (developers.google.com/search/docs/fundamentals)
"Web Search for a Planet: The Google Cluster Architecture" (Barroso et al., 2003)
"Introduction to Information Retrieval" (Manning, Raghavan, Schutze) - free online
Google Caffeine: Real-time web indexing architecture
Martin Kleppmann - Designing Data-Intensive Applications, Chapter 3 (Storage and Retrieval)

Previous: 33 - Design YouTube or Netflix | Next: 35 - Design Uber or Lyft