21 - Search Systems
Previous: 20 - Object Storage & File Systems | Next: 22 - Time-Series & Analytics Databases
Why Search Matters in System Design
Search is one of the most common requirements in system design interviews. Almost every product needs it -- e-commerce product search, social media post search, log search, autocomplete. The underlying principles (inverted indexes, relevance scoring, distributed search) appear repeatedly.
The Inverted Index
The core data structure behind full-text search. Instead of mapping documents to words, it maps words to documents.
Forward Index vs Inverted Index
Forward Index (traditional):
Doc1 --> ["the", "quick", "brown", "fox"]
Doc2 --> ["the", "lazy", "brown", "dog"]
Doc3 --> ["quick", "fox", "jumps"]
Inverted Index:
"the" --> [Doc1:pos0, Doc2:pos0]
"quick" --> [Doc1:pos1, Doc3:pos0]
"brown" --> [Doc1:pos2, Doc2:pos2]
"fox" --> [Doc1:pos3, Doc3:pos1]
"lazy" --> [Doc2:pos1]
"dog" --> [Doc2:pos3]
"jumps" --> [Doc3:pos2]
Each entry in the inverted index is called a posting list. Each posting typically stores:
- Document ID
- Term frequency (how many times the term appears in this doc)
- Position(s) within the document (for phrase queries)
- Field information (title vs body)
Building an Inverted Index
Raw Document
|
v
+------------------+
| Tokenizer | "The Quick Brown Fox!" --> ["the", "quick", "brown", "fox"]
+------------------+
|
v
+------------------+
| Lowercasing | "The" --> "the"
+------------------+
|
v
+------------------+
| Stop Words | Remove "the", "a", "is", "and", etc.
+------------------+
|
v
+------------------+
| Stemming | "running" --> "run", "foxes" --> "fox"
+------------------+
|
v
+------------------+
| Synonym Expansion| "quick" --> also index "fast"
+------------------+
|
v
Inverted Index Entry
Tokenization Strategies
| Strategy | Input | Output | Use Case |
|---|---|---|---|
| Whitespace | "New York" | ["New", "York"] | Simple, fast |
| Standard | "user@email.com" | ["user", "email.com"] | General text |
| N-gram | "search" | ["sea", "ear", "arc", "rch"] | Fuzzy/partial match |
| Edge N-gram | "search" | ["s", "se", "sea", "sear"] | Autocomplete |
| Language-specific | "laufen" (German) | ["lauf"] | Multilingual |
Stemming vs Lemmatization
| Approach | "running" | "better" | "mice" | Quality | Speed |
|---|---|---|---|---|---|
| Stemming (Porter) | "run" | "better" | "mice" | Approximate | Fast |
| Lemmatization | "run" | "good" | "mouse" | Accurate | Slower |
Stemming is usually sufficient for search. Lemmatization is better for NLP tasks.
Search Pipeline
Every search system follows this general pipeline:
User Query
|
v
+------------------+
| 1. ANALYZE | Tokenize, normalize, stem, remove stop words
+------------------+
|
v
+------------------+
| 2. QUERY PARSE | Boolean logic, phrase detection, filters
+------------------+
|
v
+------------------+
| 3. RETRIEVE | Look up inverted index, get candidate docs
+------------------+
|
v
+------------------+
| 4. RANK/SCORE | TF-IDF, BM25, or learned ranking
+------------------+
|
v
+------------------+
| 5. POST-PROCESS | Highlight snippets, facets, pagination
+------------------+
|
v
Results
Query Parsing Example
Query: "distributed systems" AND author:kleppmann NOT draft
Parsed:
MUST: term("distributed"), term("systems"), phrase match
FILTER: author = "kleppmann"
MUST_NOT: term("draft")
Scoring: TF-IDF and BM25
TF-IDF (Term Frequency - Inverse Document Frequency)
TF-IDF(t, d) = TF(t, d) * IDF(t)
TF(t, d) = count of term t in document d / total terms in d
IDF(t) = log(N / df(t))
where N = total documents, df(t) = documents containing term t
Example:
Query: "distributed"
Doc A (100 words): "distributed" appears 5 times
Doc B (500 words): "distributed" appears 5 times
Total docs: 10,000. Docs with "distributed": 100
TF(A) = 5/100 = 0.05
TF(B) = 5/500 = 0.01
IDF = log(10000/100) = 2.0
TF-IDF(A) = 0.05 * 2.0 = 0.10
TF-IDF(B) = 0.01 * 2.0 = 0.02
--> Doc A ranks higher (higher term density)
Intuition:
- High TF: The term appears frequently in this document (relevant)
- High IDF: The term is rare across all documents (discriminating)
- "the" has high TF but very low IDF (appears everywhere, contributes nothing)
BM25 (Best Matching 25)
The industry standard. Elasticsearch uses BM25 by default since v5.0.
BM25(t, d) = IDF(t) * (TF(t,d) * (k1 + 1)) / (TF(t,d) + k1 * (1 - b + b * |d|/avgdl))
Where:
k1 = 1.2 (term frequency saturation parameter)
b = 0.75 (document length normalization)
|d| = document length
avgdl = average document length
Key improvements over TF-IDF:
TF saturation visualization:
Score
^
| BM25 (saturates)
| .-------
| /
| / TF-IDF (linear, keeps growing)
| / /
| / /
| / /
|/ /
+/--------------------> Term Frequency
- TF saturation: After a term appears many times, additional occurrences contribute diminishing returns
- Length normalization: Longer documents do not get unfairly boosted just because they have more term occurrences
Scoring Comparison
| Feature | TF-IDF | BM25 |
|---|---|---|
| TF behavior | Linear (unbounded) | Saturates (logarithmic) |
| Length normalization | Basic | Configurable (b parameter) |
| Practical use | Academic baseline | Industry standard |
| Tuning knobs | None | k1 (saturation), b (length norm) |
Elasticsearch Architecture
Elasticsearch is the most widely used search engine in production. Understanding its architecture is critical for interviews.
Cluster Topology
+--------------------------------------------------+
| Elasticsearch Cluster |
| |
| +----------+ +----------+ +----------+ |
| | Node 1 | | Node 2 | | Node 3 | |
| | (Master) | | (Data) | | (Data) | |
| | | | | | | |
| | Shard 0P | | Shard 0R | | Shard 1R | |
| | Shard 1P | | Shard 2P | | Shard 2R | |
| +----------+ +----------+ +----------+ |
+--------------------------------------------------+
P = Primary shard R = Replica shard
Core Concepts
| Concept | Description | Analogy |
|---|---|---|
| Index | Collection of documents with similar structure | Database table |
| Document | A JSON object stored and indexed | Table row |
| Shard | A partition of an index (a Lucene index) | Table partition |
| Replica | Copy of a primary shard for HA + read throughput | Read replica |
| Segment | Immutable file within a shard (Lucene segment) | SSTable |
| Node | A single Elasticsearch instance | Server |
| Cluster | Group of nodes working together | Database cluster |
Near-Real-Time Search
Elasticsearch is near-real-time, not real-time:
Write --> In-memory buffer + translog (WAL)
|
Refresh (every 1s default)
|
New immutable Lucene segment (now searchable)
|
Background segment merge (compaction)
|
Larger, fewer segments
Why segments are immutable:
- No locks needed for reads (concurrent search is fast)
- Efficient caching (segments never change)
- Deletes are tombstone markers, cleaned up during merge
Search Relevance Tuning
Boosting Strategies
json{ "query": { "bool": { "should": [ { "match": { "title": { "query": "running shoes", "boost": 3 } } }, { "match": { "description": { "query": "running shoes", "boost": 1 } } }, { "match": { "brand": { "query": "running shoes", "boost": 2 } } } ] } } }
Function Score (Combining Relevance + Business Signals)
Final score = text_relevance * popularity_boost * recency_decay * personalization
Example factors:
- Sales velocity (popular items rank higher)
- Recency (newer content ranks higher)
- User affinity (items matching user preferences)
- Availability (in-stock items boosted)
Common Relevance Problems
| Problem | Symptom | Fix |
|---|---|---|
| Short queries underperform | "shoes" returns irrelevant results | Add field boosting, use synonyms |
| Long documents dominate | Wikipedia articles always rank first | Adjust BM25 b parameter |
| Exact matches buried | "Nike Air Max" ranks partial matches higher | Add exact-match clause with high boost |
| Typos return nothing | "runnign shoes" returns 0 results | Enable fuzzy matching (edit distance) |
| Stale content ranks high | Old articles outrank fresh ones | Add recency decay function |
Autocomplete / Typeahead
Implementation Approaches
| Approach | Latency | Complexity | Best For |
|---|---|---|---|
| Prefix query | Medium | Low | Simple prefix matching |
| Edge n-gram | Fast | Medium | Prefix matching at index time |
| Completion suggester | Fastest | Medium | Dedicated autocomplete |
| Trie (in-memory) | Fastest | High | Custom implementations |
Edge N-Gram Tokenizer
Index-time tokenization that creates prefixes:
Input: "elasticsearch"
Edge n-grams (min=2, max=10):
"el", "ela", "elas", "elast", "elasti", "elastic", "elastics",
"elasticse", "elasticsea"
Query for "elas" matches because "elas" is an indexed token.
No wildcard needed at query time --> fast lookup.
Completion Suggester (Elasticsearch)
Uses an in-memory FST (Finite State Transducer) for extremely fast prefix lookups.
json// Index mapping { "mappings": { "properties": { "suggest": { "type": "completion" } } } } // Query { "suggest": { "product-suggest": { "prefix": "elast", "completion": { "field": "suggest", "size": 5 } } } }
Autocomplete Architecture
User keystroke ("run")
|
[Debounce: 100-300ms]
|
[Autocomplete Service]
|
+----+----+
| |
[Trie/ [Analytics]
Cache] (what do users click
after this prefix?)
|
Return top 5-10 suggestions
ranked by: popularity, recency, personalization
Design considerations:
- Debounce requests (do not query on every keystroke)
- Cache popular prefixes aggressively
- Personalize suggestions based on user history
- Update suggestion rankings from search + click analytics logs
Faceted Search
Faceted search lets users filter and drill down into results by categories.
Search: "laptop"
Results: 15,234 laptops
Facets (aggregations):
Brand: Apple (3,421) Dell (2,890) Lenovo (2,100) ...
Price: $0-500 (4,200) $500-1000 (6,300) $1000+ (4,734)
RAM: 8GB (5,100) 16GB (7,200) 32GB (2,934)
Rating: 4+ stars (10,200) 3+ stars (13,800)
Elasticsearch implementation:
json{ "query": { "match": { "name": "laptop" } }, "aggs": { "brands": { "terms": { "field": "brand.keyword", "size": 10 } }, "price_ranges": { "range": { "field": "price", "ranges": [ { "to": 500 }, { "from": 500, "to": 1000 }, { "from": 1000 } ] } } } }
Performance note: Aggregations are expensive. Use doc_values (columnar storage) for facet fields. Pre-compute facets for high-traffic queries.
Scaling Search
Index Sharding
By hash (most common):
shard_number = hash(doc_id) % num_shards
Pros: Even distribution
Cons: Cannot add shards without reindexing (shard count fixed at index creation)
By time (for logs/events):
Index per time period:
logs-2026-04-01
logs-2026-04-02
logs-2026-04-03
Query: search across all indices matching "logs-2026-04-*"
Delete old data: just drop the old index (instant, no tombstones)
Scatter-Gather Query Pattern
Coordinator Node
|
+-- query --> Shard 0 --> top 10 results
+-- query --> Shard 1 --> top 10 results
+-- query --> Shard 2 --> top 10 results
|
<-- merge + re-rank top 30 --> return top 10
Two-phase process:
- Query phase: Each shard returns doc IDs + scores
- Fetch phase: Coordinator fetches full documents from relevant shards
Deep pagination problem: Page 1000 requires each shard to return 10,000 results. Use search_after or scroll API instead of from + size.
Routing
Skip the scatter step by routing to a specific shard:
# Index with routing
PUT /users/_doc/123?routing=user_456
# Query with routing (hits only one shard)
GET /users/_search?routing=user_456
Useful when data naturally partitions by tenant, user, or region.
Shard Sizing Guidelines
| Guideline | Recommendation |
|---|---|
| Shard size | 10-50 GB per shard |
| Shards per node | < 20 per GB of heap |
| Oversharding risk | Each shard has ~10 MB heap overhead |
| Too few shards | Cannot scale reads, creates hot spots |
Real-Time vs Batch Indexing
| Approach | Latency | Throughput | Use Case |
|---|---|---|---|
| Real-time (sync) | Immediate | Lower | E-commerce product updates |
| Near real-time | 1-5 seconds | Good | Social media posts, news |
| Batch | Minutes to hours | Highest | Analytics, full reindex |
Near Real-Time Pipeline
Data Source --> Kafka --> Indexing Consumer --> Elasticsearch
|
[Buffer + batch to ES bulk API]
[Typical: batch of 1000 docs every 5s]
Hybrid approach (common in practice):
- Batch index the full dataset nightly
- Real-time index incremental updates during the day
- Periodic full reindex to correct drift
Interview Tips
- Start with the inverted index. If asked "how does search work," begin here. It shows you understand fundamentals, not just Elasticsearch APIs.
- Know BM25. Explain TF saturation and length normalization in one sentence: "BM25 prevents long documents and keyword-stuffed content from dominating results."
- Scatter-gather is the distributed search pattern. Draw the two-phase query + fetch process.
- Autocomplete is its own sub-system. Do not bolt it onto full-text search; explain why it needs a separate data structure (trie, completion suggester, edge n-gram).
- Discuss the write path. Immutable segments, refresh interval, and the trade-off between indexing speed and search freshness.
- Relevance is hard. Pure text relevance is rarely enough; business signals (popularity, recency, personalization) always matter in production.
Resources
- DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval (inverted indexes, full-text search)
- System Design Interview Vol. 2 (Alex Xu) -- Chapter on search autocomplete
- Elasticsearch: The Definitive Guide (Clinton Gormley, Zachary Tong)
- Lucene in Action (McCandless, Hatcher, Gospodnetic)
- "Introduction to Information Retrieval" (Manning, Raghavan, Schutze) -- free online textbook
- Elasticsearch Documentation: BM25 similarity, analysis, aggregations
- Robertson & Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)
- Doug Turnbull, "Relevant Search" (Manning Publications)
Previous: 20 - Object Storage & File Systems | Next: 22 - Time-Series & Analytics Databases