34 - Design Google Search
Series: System Design & Distributed Systems Previous: 33 - Design YouTube or Netflix | Next: 35 - Design Uber or Lyft
1. Requirements
Functional Requirements
| Feature | Details |
|---|---|
| Web search | Return ranked results for text queries |
| Autocomplete | Suggest queries as user types |
| Web crawling | Discover and index the web continuously |
| Spell correction | "Did you mean..." suggestions |
| Knowledge panel | Structured answers for entities |
| Image/video search | Multimedia results |
| Ads | Sponsored results alongside organic |
Non-Functional Requirements
- Latency: < 200ms end-to-end for search results
- Availability: 99.999%
- Scale: 8.5B searches/day, index 100B+ web pages
- Freshness: Breaking news indexed within minutes
- Relevance: Most useful results on the first page
2. Capacity Estimation
Searches/day: 8.5B
Search QPS: ~100K QPS (avg), ~200K peak
Index size: 100B+ pages
Avg page size: 100KB (raw HTML)
Raw web storage: 100B x 100KB = 10PB
Inverted index size: ~1PB (compressed)
Crawl rate: ~1B pages/day recrawl
| Resource | Estimate |
|---|---|
| Search QPS | ~100K |
| Index size | ~1PB compressed |
| Crawl bandwidth | ~100TB/day |
| Serving infrastructure | 1M+ servers (Google scale) |
3. System Overview: Three Major Subsystems
+-------------------+ +-------------------+ +-------------------+
| WEB CRAWLER | | INDEXER | | SEARCH SERVING |
| | | | | |
| Discover & fetch |--->| Parse, analyze, |--->| Query processing, |
| web pages | | build inverted | | ranking, serving |
| | | index | | results |
+-------------------+ +-------------------+ +-------------------+
4. Web Crawler
BFS Crawling Strategy
Seed URLs
|
v
+--+-----------+
| URL Frontier | (priority queue of URLs to crawl)
+--+-----------+
|
v
+--+-----------+
| DNS Resolver | (cached, batched lookups)
+--+-----------+
|
v
+--+-----------+
| Fetcher | (HTTP GET, respect robots.txt)
+--+-----------+
|
v
+--+-----------+
| Content | (extract text, links, metadata)
| Parser |
+--+-----------+
|
+-----------+------------------+
| | |
v v v
+--+---+ +---+----------+ +----+-------+
|Dedup | |Link Extractor| |Content |
|Check | | | |Store |
+--+---+ +---+----------+ +------------+
| |
v v
Skip if Add new URLs
already to Frontier
seen
URL Frontier: Priority + Politeness
URL Frontier
|
+----------------+----------------+
| |
Priority Queues Politeness Queues
(by page importance) (per domain rate limit)
| |
+-----+-----+ +---------+---------+
| | | | | |
High Med Low cnn.com bbc.com wiki.org
(news (blog (deep (1 req/ (1 req/ (1 req/
home posts links) sec) sec) sec)
pages)
robots.txt Compliance
# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 2
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml
- Cache robots.txt per domain (refresh daily)
- Respect
Crawl-delaydirective - Parse sitemaps for discovery hints
Deduplication
| Method | Purpose | Technique |
|---|---|---|
| URL dedup | Avoid re-fetching same URL | Bloom filter (approximate), hash table (exact) |
| Content dedup | Detect mirror/duplicate pages | SimHash or MinHash fingerprinting |
| Near-duplicate | Similar but not identical pages | SimHash with Hamming distance threshold |
Interview Tip: Mention that a Bloom filter with 1% false positive rate for 10B URLs needs ~12GB of memory -- very practical.
5. Indexer: Inverted Index Construction
What Is an Inverted Index?
Forward Index (document -> words):
Doc1: "the quick brown fox"
Doc2: "the lazy brown dog"
Doc3: "quick fox jumps"
Inverted Index (word -> documents):
"the" -> [Doc1:pos0, Doc2:pos0]
"quick" -> [Doc1:pos1, Doc3:pos0]
"brown" -> [Doc1:pos2, Doc2:pos2]
"fox" -> [Doc1:pos3, Doc3:pos1]
"lazy" -> [Doc2:pos1]
"dog" -> [Doc2:pos3]
"jumps" -> [Doc3:pos2]
Index Entry Structure
term -> {
doc_freq: 3, // number of documents containing term
postings: [
{ doc_id: 1, tf: 2, positions: [0, 15] },
{ doc_id: 7, tf: 1, positions: [42] },
{ doc_id: 23, tf: 5, positions: [3, 8, 19, 33, 55] }
]
}
MapReduce for Index Building
Phase 1: MAP
Input: Raw HTML pages
Output: (term, {doc_id, position, metadata})
Phase 2: SHUFFLE
Group by term
Phase 3: REDUCE
Input: (term, [{doc_id, position}...])
Output: (term, sorted posting list)
Phase 4: MERGE
Combine reducer outputs into final index shards
6. PageRank Algorithm
Concept
A page is important if important pages link to it.
Page A (PR=0.5) -----> Page C
Page B (PR=0.3) -----> Page C
Page D (PR=0.8) -----> Page C
Page C receives PageRank from A, B, D:
PR(C) = d * (PR(A)/outlinks(A) + PR(B)/outlinks(B) + PR(D)/outlinks(D)) + (1-d)/N
where d = damping factor (0.85), N = total pages
Iterative Computation
Initialize: All pages PR = 1/N
Repeat until convergence:
For each page P:
PR(P) = (1-d)/N + d * SUM(PR(Q)/L(Q)) for all Q linking to P
where L(Q) = number of outgoing links from Q
Modern Ranking Signals (Beyond PageRank)
| Signal | Weight | Description |
|---|---|---|
| PageRank | Medium | Link-based authority |
| Content relevance | High | BM25 / TF-IDF match to query |
| Click-through rate | High | How often users click this result |
| Freshness | Medium | Recency of content |
| Mobile friendliness | Medium | Responsive design |
| Page speed | Medium | Core Web Vitals |
| Domain authority | Medium | Overall domain trust |
| HTTPS | Low | Secure connection |
| User engagement | High | Dwell time, bounce rate |
7. Query Processing
Query Flow
User: "best coffee shops in NYC"
|
v
+------+--------+
| Query Parser |
| - Tokenize |
| - Remove stop |
| words |
| - Stem/lemma |
+------+--------+
|
v
["best", "coffee", "shop", "NYC"]
|
v
+------+--------+
| Spell Check |
| "coffe" -> |
| "coffee" |
+------+--------+
|
v
+------+--------+
| Query |
| Expansion |
| "NYC" -> |
| "New York |
| City" |
+------+--------+
|
v
+------+----------+
| Index Lookup |
| (parallel across|
| shards) |
+------+----------+
|
v
+------+----------+
| Scoring & |
| Ranking |
| (BM25 + signals)|
+------+----------+
|
v
+------+----------+
| Re-ranking |
| (ML model, |
| personalization|
| diversity) |
+------+----------+
|
v
Top 10 results + snippets + knowledge panel
Spell Correction
Approach 1: Edit Distance
"cofee" -> candidates within edit distance 1-2
-> "coffee" (1 edit), "cone" (2 edits)
-> Pick most likely based on query frequency
Approach 2: Noisy Channel Model
P(correction | query) = P(query | correction) * P(correction)
- P(correction): language model (how common is the word)
- P(query | correction): error model (how likely is this typo)
8. Search Serving Architecture
Index Sharding
100B documents split across shards:
Document Space Partitioning:
Shard 1: docs 1 - 10B
Shard 2: docs 10B - 20B
...
Shard 10: docs 90B - 100B
Each shard has replicas for availability:
Shard 1: [Replica A, Replica B, Replica C]
Scatter-Gather Pattern
Query Router
|
+-- Query Shard 1, Replica A --> Top 100 results
+-- Query Shard 2, Replica B --> Top 100 results
+-- Query Shard 3, Replica C --> Top 100 results
...
+-- Query Shard 10, Replica A -> Top 100 results
|
v
Merge & Re-rank top 1000 --> Return top 10
Caching Layers
| Cache Level | What's Cached | Hit Rate |
|---|---|---|
| Result cache | Full SERP for popular queries | ~30% |
| Posting list cache | Frequently accessed terms | ~80% |
| Document cache | Popular document metadata | ~60% |
| DNS cache | Crawler DNS lookups | ~95% |
Interview Tip: Mention that ~30% of queries are repeated (head queries). Caching full results for these gives massive savings.
9. Autocomplete
User types: "how to ma"
|
v
+------+----------+
| Trie / Prefix |
| Index |
| (in-memory) |
+------+----------+
|
v
Candidates:
"how to make pancakes" (freq: 50K/day)
"how to make money online" (freq: 45K/day)
"how to make a resume" (freq: 30K/day)
|
v
+------+----------+
| Ranking |
| - Query freq |
| - Freshness |
| - Personalized |
+------+----------+
|
v
Top 5 suggestions
Data Structure: Trie with Top-K
Root
|-- h
| |-- o
| | |-- w
| | | |-- [top5: "how to...", "how old...", ...]
| | | |-- _ (space)
| | | |-- t
| | | |-- o
| | | |-- [top5: "how to make...", ...]
- Pre-compute top 5 suggestions at each trie node
- Update asynchronously from query logs
- Shard by prefix range for distribution
10. Knowledge Graph Panel
Query: "Albert Einstein"
|
v
Entity Recognition: identifies "Albert Einstein" as a Person entity
|
v
Knowledge Graph Lookup:
{
name: "Albert Einstein",
type: "Person",
born: "1879-03-14",
died: "1955-04-18",
fields: ["Physics", "Mathematics"],
known_for: ["Theory of Relativity", "E=mc^2"],
image: "einstein.jpg",
related: ["Niels Bohr", "Max Planck"]
}
|
v
Render as structured panel alongside search results
- Knowledge Graph: billions of entities and relationships
- Sources: Wikipedia, Freebase, structured web data
- Entity linking: map query terms to canonical entities
11. Freshness: Real-Time Indexing
Two Indexing Pipelines
Batch Pipeline (hours-to-days):
Crawl --> Parse --> MapReduce --> Full Index Rebuild
Real-Time Pipeline (seconds-to-minutes):
Breaking news / trending content
|
v
Priority Crawl Queue --> Fast Fetch --> Incremental Indexer
| |
v v
Change Detection Real-Time Index
(RSS feeds, Twitter, (small, frequently
news sitemaps) merged with main)
Serving Both Indexes
Query --> Search Router
|
+---------+---------+
| |
Main Index Real-Time Index
(100B docs, (1M docs,
updated daily) updated per minute)
| |
+---------+---------+
|
Merge results, re-rank
12. Ads Integration (Brief)
Search Results Page:
[Ad] Best Coffee Shops - Visit CoffeeGuide.com
[Ad] NYC Coffee Tour - Book Now
---
1. Organic result: "Top 10 Coffee Shops in NYC - TimeOut"
2. Organic result: "Best Coffee Near Me - Yelp"
...
- Ad auction runs in parallel with organic search
- Advertisers bid on keywords
- Ranking: bid amount x quality score x expected CTR
- Ads are clearly labeled, separate from organic results
- Revenue model: cost-per-click (CPC)
13. Complete System Diagram
+-------------+
| Users |
+------+------+
|
+------+------+
| DNS / CDN |
+------+------+
|
+------+--------+
| Load Balancer |
+------+--------+
|
+------+--------+--------+--------+
| | | |
v v v v
+--------+ +--------+ +-----+ +------+
|Search | |Auto- | |Ads | |Image |
|Frontend| |complete| |Svc | |Search|
+---+----+ +---+----+ +--+--+ +------+
| | |
v v |
+---+----+ +--+------+ |
|Query | |Trie | |
|Process.| |Index | |
+---+----+ +---------+ |
| |
v |
+---+-----------+ |
| Index Serving |<--------+
| (Scatter- |
| Gather) |
+---+-----------+
|
+---+---+---+---+---+
| | | | | |
S1 S2 S3 S4 ... SN (Index Shards, each with replicas)
=== OFFLINE PIPELINE ===
+-----------+ +----------+ +-----------+
| Web |---->| Content |---->| Indexer |
| Crawler | | Store | | (MapReduce|
| (BFS + | | (raw | | + increm)|
| frontier)| | pages) | +-----+-----+
+-----------+ +----------+ |
+-----+-----+
| Inverted |
| Index |
| Shards |
+-----------+
14. Interview Tips
- Three subsystems: Crawler, Indexer, Search Serving -- structure your answer this way
- Inverted index is the core: Be ready to explain it with a concrete example
- PageRank is just one signal: Modern search uses hundreds of ranking features
- Sharding strategy: Document partitioning vs term partitioning (document is standard)
- Freshness matters: Explain the dual pipeline (batch + real-time)
- Autocomplete is a separate system: Trie-based, pre-computed, updated asynchronously
- Don't forget caching: Result cache for head queries saves enormous compute
- Politeness in crawling: Rate limiting per domain, robots.txt -- shows maturity
15. Resources
- Alex Xu - System Design Interview Vol. 2, Chapter 2: Design Google Search
- "The Anatomy of a Large-Scale Hypertextual Web Search Engine" (Brin & Page, 1998)
- Google "How Search Works" (developers.google.com/search/docs/fundamentals)
- "Web Search for a Planet: The Google Cluster Architecture" (Barroso et al., 2003)
- "Introduction to Information Retrieval" (Manning, Raghavan, Schutze) - free online
- Google Caffeine: Real-time web indexing architecture
- Martin Kleppmann - Designing Data-Intensive Applications, Chapter 3 (Storage and Retrieval)
Previous: 33 - Design YouTube or Netflix | Next: 35 - Design Uber or Lyft