21 - Search Systems

Previous: 20 - Object Storage & File Systems | Next: 22 - Time-Series & Analytics Databases

Why Search Matters in System Design

Search is one of the most common requirements in system design interviews. Almost every product needs it -- e-commerce product search, social media post search, log search, autocomplete. The underlying principles (inverted indexes, relevance scoring, distributed search) appear repeatedly.

The Inverted Index

The core data structure behind full-text search. Instead of mapping documents to words, it maps words to documents.

Forward Index vs Inverted Index

Forward Index (traditional):
  Doc1 --> ["the", "quick", "brown", "fox"]
  Doc2 --> ["the", "lazy", "brown", "dog"]
  Doc3 --> ["quick", "fox", "jumps"]

Inverted Index:
  "the"    --> [Doc1:pos0, Doc2:pos0]
  "quick"  --> [Doc1:pos1, Doc3:pos0]
  "brown"  --> [Doc1:pos2, Doc2:pos2]
  "fox"    --> [Doc1:pos3, Doc3:pos1]
  "lazy"   --> [Doc2:pos1]
  "dog"    --> [Doc2:pos3]
  "jumps"  --> [Doc3:pos2]

Each entry in the inverted index is called a posting list. Each posting typically stores:

Document ID
Term frequency (how many times the term appears in this doc)
Position(s) within the document (for phrase queries)
Field information (title vs body)

Building an Inverted Index

Raw Document
     |
     v
  +------------------+
  |    Tokenizer      |  "The Quick Brown Fox!" --> ["the", "quick", "brown", "fox"]
  +------------------+
           |
           v
  +------------------+
  |    Lowercasing    |  "The" --> "the"
  +------------------+
           |
           v
  +------------------+
  |   Stop Words      |  Remove "the", "a", "is", "and", etc.
  +------------------+
           |
           v
  +------------------+
  |    Stemming       |  "running" --> "run", "foxes" --> "fox"
  +------------------+
           |
           v
  +------------------+
  |  Synonym Expansion|  "quick" --> also index "fast"
  +------------------+
           |
           v
  Inverted Index Entry

Tokenization Strategies

Strategy	Input	Output	Use Case
Whitespace	"New York"	["New", "York"]	Simple, fast
Standard	"user@email.com"	["user", "email.com"]	General text
N-gram	"search"	["sea", "ear", "arc", "rch"]	Fuzzy/partial match
Edge N-gram	"search"	["s", "se", "sea", "sear"]	Autocomplete
Language-specific	"laufen" (German)	["lauf"]	Multilingual

Stemming vs Lemmatization

Approach	"running"	"better"	"mice"	Quality	Speed
Stemming (Porter)	"run"	"better"	"mice"	Approximate	Fast
Lemmatization	"run"	"good"	"mouse"	Accurate	Slower

Stemming is usually sufficient for search. Lemmatization is better for NLP tasks.

Search Pipeline

Every search system follows this general pipeline:

User Query
    |
    v
+------------------+
|  1. ANALYZE      |  Tokenize, normalize, stem, remove stop words
+------------------+
    |
    v
+------------------+
|  2. QUERY PARSE  |  Boolean logic, phrase detection, filters
+------------------+
    |
    v
+------------------+
|  3. RETRIEVE     |  Look up inverted index, get candidate docs
+------------------+
    |
    v
+------------------+
|  4. RANK/SCORE   |  TF-IDF, BM25, or learned ranking
+------------------+
    |
    v
+------------------+
|  5. POST-PROCESS |  Highlight snippets, facets, pagination
+------------------+
    |
    v
  Results

Query Parsing Example

Query: "distributed systems" AND author:kleppmann NOT draft

Parsed:
  MUST:     term("distributed"), term("systems"), phrase match
  FILTER:   author = "kleppmann"
  MUST_NOT: term("draft")

Scoring: TF-IDF and BM25

TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF(t, d) = TF(t, d) * IDF(t)

TF(t, d) = count of term t in document d / total terms in d
IDF(t)   = log(N / df(t))
  where N = total documents, df(t) = documents containing term t

Example:
  Query: "distributed"
  Doc A (100 words): "distributed" appears 5 times
  Doc B (500 words): "distributed" appears 5 times
  Total docs: 10,000. Docs with "distributed": 100

  TF(A) = 5/100 = 0.05
  TF(B) = 5/500 = 0.01
  IDF    = log(10000/100) = 2.0

  TF-IDF(A) = 0.05 * 2.0 = 0.10
  TF-IDF(B) = 0.01 * 2.0 = 0.02
  --> Doc A ranks higher (higher term density)

Intuition:

High TF: The term appears frequently in this document (relevant)
High IDF: The term is rare across all documents (discriminating)
"the" has high TF but very low IDF (appears everywhere, contributes nothing)

BM25 (Best Matching 25)

The industry standard. Elasticsearch uses BM25 by default since v5.0.

BM25(t, d) = IDF(t) * (TF(t,d) * (k1 + 1)) / (TF(t,d) + k1 * (1 - b + b * |d|/avgdl))

Where:
  k1 = 1.2  (term frequency saturation parameter)
  b  = 0.75 (document length normalization)
  |d| = document length
  avgdl = average document length

Key improvements over TF-IDF:

TF saturation visualization:

Score
  ^
  |          BM25 (saturates)
  |        .-------
  |      /
  |    /      TF-IDF (linear, keeps growing)
  |   /     /
  |  /    /
  | /   /
  |/  /
  +/--------------------> Term Frequency

TF saturation: After a term appears many times, additional occurrences contribute diminishing returns
Length normalization: Longer documents do not get unfairly boosted just because they have more term occurrences

Scoring Comparison

Feature	TF-IDF	BM25
TF behavior	Linear (unbounded)	Saturates (logarithmic)
Length normalization	Basic	Configurable (b parameter)
Practical use	Academic baseline	Industry standard
Tuning knobs	None	k1 (saturation), b (length norm)

Elasticsearch Architecture

Elasticsearch is the most widely used search engine in production. Understanding its architecture is critical for interviews.

Cluster Topology

+--------------------------------------------------+
|                 Elasticsearch Cluster              |
|                                                    |
|  +----------+  +----------+  +----------+         |
|  |  Node 1  |  |  Node 2  |  |  Node 3  |         |
|  | (Master) |  | (Data)   |  | (Data)   |         |
|  |          |  |          |  |          |         |
|  | Shard 0P |  | Shard 0R |  | Shard 1R |         |
|  | Shard 1P |  | Shard 2P |  | Shard 2R |         |
|  +----------+  +----------+  +----------+         |
+--------------------------------------------------+

P = Primary shard    R = Replica shard

Core Concepts

Concept	Description	Analogy
Index	Collection of documents with similar structure	Database table
Document	A JSON object stored and indexed	Table row
Shard	A partition of an index (a Lucene index)	Table partition
Replica	Copy of a primary shard for HA + read throughput	Read replica
Segment	Immutable file within a shard (Lucene segment)	SSTable
Node	A single Elasticsearch instance	Server
Cluster	Group of nodes working together	Database cluster

Near-Real-Time Search

Elasticsearch is near-real-time, not real-time:

Write --> In-memory buffer + translog (WAL)
              |
         Refresh (every 1s default)
              |
         New immutable Lucene segment (now searchable)
              |
         Background segment merge (compaction)
              |
         Larger, fewer segments

Why segments are immutable:

No locks needed for reads (concurrent search is fast)
Efficient caching (segments never change)
Deletes are tombstone markers, cleaned up during merge

Search Relevance Tuning

Boosting Strategies

json
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": { "query": "running shoes", "boost": 3 } } },
        { "match": { "description": { "query": "running shoes", "boost": 1 } } },
        { "match": { "brand": { "query": "running shoes", "boost": 2 } } }
      ]
    }
  }
}

Function Score (Combining Relevance + Business Signals)

Final score = text_relevance * popularity_boost * recency_decay * personalization

Example factors:
  - Sales velocity (popular items rank higher)
  - Recency (newer content ranks higher)
  - User affinity (items matching user preferences)
  - Availability (in-stock items boosted)

Common Relevance Problems

Problem	Symptom	Fix
Short queries underperform	"shoes" returns irrelevant results	Add field boosting, use synonyms
Long documents dominate	Wikipedia articles always rank first	Adjust BM25 b parameter
Exact matches buried	"Nike Air Max" ranks partial matches higher	Add exact-match clause with high boost
Typos return nothing	"runnign shoes" returns 0 results	Enable fuzzy matching (edit distance)
Stale content ranks high	Old articles outrank fresh ones	Add recency decay function

Autocomplete / Typeahead

Implementation Approaches

Approach	Latency	Complexity	Best For
Prefix query	Medium	Low	Simple prefix matching
Edge n-gram	Fast	Medium	Prefix matching at index time
Completion suggester	Fastest	Medium	Dedicated autocomplete
Trie (in-memory)	Fastest	High	Custom implementations

Edge N-Gram Tokenizer

Index-time tokenization that creates prefixes:

Input: "elasticsearch"

Edge n-grams (min=2, max=10):
  "el", "ela", "elas", "elast", "elasti", "elastic", "elastics",
  "elasticse", "elasticsea"

Query for "elas" matches because "elas" is an indexed token.
No wildcard needed at query time --> fast lookup.

Completion Suggester (Elasticsearch)

Uses an in-memory FST (Finite State Transducer) for extremely fast prefix lookups.

json
// Index mapping
{
  "mappings": {
    "properties": {
      "suggest": { "type": "completion" }
    }
  }
}

// Query
{
  "suggest": {
    "product-suggest": {
      "prefix": "elast",
      "completion": { "field": "suggest", "size": 5 }
    }
  }
}

Autocomplete Architecture

  User keystroke ("run")
       |
  [Debounce: 100-300ms]
       |
  [Autocomplete Service]
       |
  +----+----+
  |         |
  [Trie/   [Analytics]
   Cache]   (what do users click
             after this prefix?)
       |
  Return top 5-10 suggestions
  ranked by: popularity, recency, personalization

Design considerations:

Debounce requests (do not query on every keystroke)
Cache popular prefixes aggressively
Personalize suggestions based on user history
Update suggestion rankings from search + click analytics logs

Faceted Search

Faceted search lets users filter and drill down into results by categories.

Search: "laptop"

Results: 15,234 laptops

Facets (aggregations):
  Brand:    Apple (3,421)  Dell (2,890)  Lenovo (2,100)  ...
  Price:    $0-500 (4,200) $500-1000 (6,300) $1000+ (4,734)
  RAM:      8GB (5,100)    16GB (7,200)   32GB (2,934)
  Rating:   4+ stars (10,200)  3+ stars (13,800)

Elasticsearch implementation:

json
{
  "query": { "match": { "name": "laptop" } },
  "aggs": {
    "brands": { "terms": { "field": "brand.keyword", "size": 10 } },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 500 }, { "from": 500, "to": 1000 }, { "from": 1000 }
        ]
      }
    }
  }
}

Performance note: Aggregations are expensive. Use doc_values (columnar storage) for facet fields. Pre-compute facets for high-traffic queries.

Scaling Search

Index Sharding

By hash (most common):

shard_number = hash(doc_id) % num_shards

Pros: Even distribution
Cons: Cannot add shards without reindexing (shard count fixed at index creation)

By time (for logs/events):

Index per time period:
  logs-2026-04-01
  logs-2026-04-02
  logs-2026-04-03

Query: search across all indices matching "logs-2026-04-*"
Delete old data: just drop the old index (instant, no tombstones)

Scatter-Gather Query Pattern

Coordinator Node
  |
  +-- query --> Shard 0  --> top 10 results
  +-- query --> Shard 1  --> top 10 results
  +-- query --> Shard 2  --> top 10 results
  |
  <-- merge + re-rank top 30 --> return top 10

Two-phase process:

Query phase: Each shard returns doc IDs + scores
Fetch phase: Coordinator fetches full documents from relevant shards

Deep pagination problem: Page 1000 requires each shard to return 10,000 results. Use search_after or scroll API instead of from + size.

Routing

Skip the scatter step by routing to a specific shard:

# Index with routing
PUT /users/_doc/123?routing=user_456

# Query with routing (hits only one shard)
GET /users/_search?routing=user_456

Useful when data naturally partitions by tenant, user, or region.

Shard Sizing Guidelines

Guideline	Recommendation
Shard size	10-50 GB per shard
Shards per node	< 20 per GB of heap
Oversharding risk	Each shard has ~10 MB heap overhead
Too few shards	Cannot scale reads, creates hot spots

Real-Time vs Batch Indexing

Approach	Latency	Throughput	Use Case
Real-time (sync)	Immediate	Lower	E-commerce product updates
Near real-time	1-5 seconds	Good	Social media posts, news
Batch	Minutes to hours	Highest	Analytics, full reindex

Near Real-Time Pipeline

  Data Source --> Kafka --> Indexing Consumer --> Elasticsearch
                   |
              [Buffer + batch to ES bulk API]
              [Typical: batch of 1000 docs every 5s]

Hybrid approach (common in practice):

Batch index the full dataset nightly
Real-time index incremental updates during the day
Periodic full reindex to correct drift

Interview Tips

Start with the inverted index. If asked "how does search work," begin here. It shows you understand fundamentals, not just Elasticsearch APIs.
Know BM25. Explain TF saturation and length normalization in one sentence: "BM25 prevents long documents and keyword-stuffed content from dominating results."
Scatter-gather is the distributed search pattern. Draw the two-phase query + fetch process.
Autocomplete is its own sub-system. Do not bolt it onto full-text search; explain why it needs a separate data structure (trie, completion suggester, edge n-gram).
Discuss the write path. Immutable segments, refresh interval, and the trade-off between indexing speed and search freshness.
Relevance is hard. Pure text relevance is rarely enough; business signals (popularity, recency, personalization) always matter in production.

Resources

DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval (inverted indexes, full-text search)
System Design Interview Vol. 2 (Alex Xu) -- Chapter on search autocomplete
Elasticsearch: The Definitive Guide (Clinton Gormley, Zachary Tong)
Lucene in Action (McCandless, Hatcher, Gospodnetic)
"Introduction to Information Retrieval" (Manning, Raghavan, Schutze) -- free online textbook
Elasticsearch Documentation: BM25 similarity, analysis, aggregations
Robertson & Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)
Doug Turnbull, "Relevant Search" (Manning Publications)

Previous: 20 - Object Storage & File Systems | Next: 22 - Time-Series & Analytics Databases