21 - Search Systems

Previous: 20 - Object Storage & File Systems | Next: 22 - Time-Series & Analytics Databases


Why Search Matters in System Design

Search is one of the most common requirements in system design interviews. Almost every product needs it -- e-commerce product search, social media post search, log search, autocomplete. The underlying principles (inverted indexes, relevance scoring, distributed search) appear repeatedly.


The Inverted Index

The core data structure behind full-text search. Instead of mapping documents to words, it maps words to documents.

Forward Index vs Inverted Index

Forward Index (traditional):
  Doc1 --> ["the", "quick", "brown", "fox"]
  Doc2 --> ["the", "lazy", "brown", "dog"]
  Doc3 --> ["quick", "fox", "jumps"]

Inverted Index:
  "the"    --> [Doc1:pos0, Doc2:pos0]
  "quick"  --> [Doc1:pos1, Doc3:pos0]
  "brown"  --> [Doc1:pos2, Doc2:pos2]
  "fox"    --> [Doc1:pos3, Doc3:pos1]
  "lazy"   --> [Doc2:pos1]
  "dog"    --> [Doc2:pos3]
  "jumps"  --> [Doc3:pos2]

Each entry in the inverted index is called a posting list. Each posting typically stores:

  • Document ID
  • Term frequency (how many times the term appears in this doc)
  • Position(s) within the document (for phrase queries)
  • Field information (title vs body)

Building an Inverted Index

Raw Document
     |
     v
  +------------------+
  |    Tokenizer      |  "The Quick Brown Fox!" --> ["the", "quick", "brown", "fox"]
  +------------------+
           |
           v
  +------------------+
  |    Lowercasing    |  "The" --> "the"
  +------------------+
           |
           v
  +------------------+
  |   Stop Words      |  Remove "the", "a", "is", "and", etc.
  +------------------+
           |
           v
  +------------------+
  |    Stemming       |  "running" --> "run", "foxes" --> "fox"
  +------------------+
           |
           v
  +------------------+
  |  Synonym Expansion|  "quick" --> also index "fast"
  +------------------+
           |
           v
  Inverted Index Entry

Tokenization Strategies

StrategyInputOutputUse Case
Whitespace"New York"["New", "York"]Simple, fast
Standard"user@email.com"["user", "email.com"]General text
N-gram"search"["sea", "ear", "arc", "rch"]Fuzzy/partial match
Edge N-gram"search"["s", "se", "sea", "sear"]Autocomplete
Language-specific"laufen" (German)["lauf"]Multilingual

Stemming vs Lemmatization

Approach"running""better""mice"QualitySpeed
Stemming (Porter)"run""better""mice"ApproximateFast
Lemmatization"run""good""mouse"AccurateSlower

Stemming is usually sufficient for search. Lemmatization is better for NLP tasks.


Search Pipeline

Every search system follows this general pipeline:

User Query
    |
    v
+------------------+
|  1. ANALYZE      |  Tokenize, normalize, stem, remove stop words
+------------------+
    |
    v
+------------------+
|  2. QUERY PARSE  |  Boolean logic, phrase detection, filters
+------------------+
    |
    v
+------------------+
|  3. RETRIEVE     |  Look up inverted index, get candidate docs
+------------------+
    |
    v
+------------------+
|  4. RANK/SCORE   |  TF-IDF, BM25, or learned ranking
+------------------+
    |
    v
+------------------+
|  5. POST-PROCESS |  Highlight snippets, facets, pagination
+------------------+
    |
    v
  Results

Query Parsing Example

Query: "distributed systems" AND author:kleppmann NOT draft

Parsed:
  MUST:     term("distributed"), term("systems"), phrase match
  FILTER:   author = "kleppmann"
  MUST_NOT: term("draft")

Scoring: TF-IDF and BM25

TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF(t, d) = TF(t, d) * IDF(t)

TF(t, d) = count of term t in document d / total terms in d
IDF(t)   = log(N / df(t))
  where N = total documents, df(t) = documents containing term t

Example:
  Query: "distributed"
  Doc A (100 words): "distributed" appears 5 times
  Doc B (500 words): "distributed" appears 5 times
  Total docs: 10,000. Docs with "distributed": 100

  TF(A) = 5/100 = 0.05
  TF(B) = 5/500 = 0.01
  IDF    = log(10000/100) = 2.0

  TF-IDF(A) = 0.05 * 2.0 = 0.10
  TF-IDF(B) = 0.01 * 2.0 = 0.02
  --> Doc A ranks higher (higher term density)

Intuition:

  • High TF: The term appears frequently in this document (relevant)
  • High IDF: The term is rare across all documents (discriminating)
  • "the" has high TF but very low IDF (appears everywhere, contributes nothing)

BM25 (Best Matching 25)

The industry standard. Elasticsearch uses BM25 by default since v5.0.

BM25(t, d) = IDF(t) * (TF(t,d) * (k1 + 1)) / (TF(t,d) + k1 * (1 - b + b * |d|/avgdl))

Where:
  k1 = 1.2  (term frequency saturation parameter)
  b  = 0.75 (document length normalization)
  |d| = document length
  avgdl = average document length

Key improvements over TF-IDF:

TF saturation visualization:

Score
  ^
  |          BM25 (saturates)
  |        .-------
  |      /
  |    /      TF-IDF (linear, keeps growing)
  |   /     /
  |  /    /
  | /   /
  |/  /
  +/--------------------> Term Frequency
  • TF saturation: After a term appears many times, additional occurrences contribute diminishing returns
  • Length normalization: Longer documents do not get unfairly boosted just because they have more term occurrences

Scoring Comparison

FeatureTF-IDFBM25
TF behaviorLinear (unbounded)Saturates (logarithmic)
Length normalizationBasicConfigurable (b parameter)
Practical useAcademic baselineIndustry standard
Tuning knobsNonek1 (saturation), b (length norm)

Elasticsearch Architecture

Elasticsearch is the most widely used search engine in production. Understanding its architecture is critical for interviews.

Cluster Topology

+--------------------------------------------------+
|                 Elasticsearch Cluster              |
|                                                    |
|  +----------+  +----------+  +----------+         |
|  |  Node 1  |  |  Node 2  |  |  Node 3  |         |
|  | (Master) |  | (Data)   |  | (Data)   |         |
|  |          |  |          |  |          |         |
|  | Shard 0P |  | Shard 0R |  | Shard 1R |         |
|  | Shard 1P |  | Shard 2P |  | Shard 2R |         |
|  +----------+  +----------+  +----------+         |
+--------------------------------------------------+

P = Primary shard    R = Replica shard

Core Concepts

ConceptDescriptionAnalogy
IndexCollection of documents with similar structureDatabase table
DocumentA JSON object stored and indexedTable row
ShardA partition of an index (a Lucene index)Table partition
ReplicaCopy of a primary shard for HA + read throughputRead replica
SegmentImmutable file within a shard (Lucene segment)SSTable
NodeA single Elasticsearch instanceServer
ClusterGroup of nodes working togetherDatabase cluster

Near-Real-Time Search

Elasticsearch is near-real-time, not real-time:

Write --> In-memory buffer + translog (WAL)
              |
         Refresh (every 1s default)
              |
         New immutable Lucene segment (now searchable)
              |
         Background segment merge (compaction)
              |
         Larger, fewer segments

Why segments are immutable:

  • No locks needed for reads (concurrent search is fast)
  • Efficient caching (segments never change)
  • Deletes are tombstone markers, cleaned up during merge

Search Relevance Tuning

Boosting Strategies

json
{ "query": { "bool": { "should": [ { "match": { "title": { "query": "running shoes", "boost": 3 } } }, { "match": { "description": { "query": "running shoes", "boost": 1 } } }, { "match": { "brand": { "query": "running shoes", "boost": 2 } } } ] } } }

Function Score (Combining Relevance + Business Signals)

Final score = text_relevance * popularity_boost * recency_decay * personalization

Example factors:
  - Sales velocity (popular items rank higher)
  - Recency (newer content ranks higher)
  - User affinity (items matching user preferences)
  - Availability (in-stock items boosted)

Common Relevance Problems

ProblemSymptomFix
Short queries underperform"shoes" returns irrelevant resultsAdd field boosting, use synonyms
Long documents dominateWikipedia articles always rank firstAdjust BM25 b parameter
Exact matches buried"Nike Air Max" ranks partial matches higherAdd exact-match clause with high boost
Typos return nothing"runnign shoes" returns 0 resultsEnable fuzzy matching (edit distance)
Stale content ranks highOld articles outrank fresh onesAdd recency decay function

Autocomplete / Typeahead

Implementation Approaches

ApproachLatencyComplexityBest For
Prefix queryMediumLowSimple prefix matching
Edge n-gramFastMediumPrefix matching at index time
Completion suggesterFastestMediumDedicated autocomplete
Trie (in-memory)FastestHighCustom implementations

Edge N-Gram Tokenizer

Index-time tokenization that creates prefixes:

Input: "elasticsearch"

Edge n-grams (min=2, max=10):
  "el", "ela", "elas", "elast", "elasti", "elastic", "elastics",
  "elasticse", "elasticsea"

Query for "elas" matches because "elas" is an indexed token.
No wildcard needed at query time --> fast lookup.

Completion Suggester (Elasticsearch)

Uses an in-memory FST (Finite State Transducer) for extremely fast prefix lookups.

json
// Index mapping { "mappings": { "properties": { "suggest": { "type": "completion" } } } } // Query { "suggest": { "product-suggest": { "prefix": "elast", "completion": { "field": "suggest", "size": 5 } } } }

Autocomplete Architecture

  User keystroke ("run")
       |
  [Debounce: 100-300ms]
       |
  [Autocomplete Service]
       |
  +----+----+
  |         |
  [Trie/   [Analytics]
   Cache]   (what do users click
             after this prefix?)
       |
  Return top 5-10 suggestions
  ranked by: popularity, recency, personalization

Design considerations:

  • Debounce requests (do not query on every keystroke)
  • Cache popular prefixes aggressively
  • Personalize suggestions based on user history
  • Update suggestion rankings from search + click analytics logs

Faceted Search

Faceted search lets users filter and drill down into results by categories.

Search: "laptop"

Results: 15,234 laptops

Facets (aggregations):
  Brand:    Apple (3,421)  Dell (2,890)  Lenovo (2,100)  ...
  Price:    $0-500 (4,200) $500-1000 (6,300) $1000+ (4,734)
  RAM:      8GB (5,100)    16GB (7,200)   32GB (2,934)
  Rating:   4+ stars (10,200)  3+ stars (13,800)

Elasticsearch implementation:

json
{ "query": { "match": { "name": "laptop" } }, "aggs": { "brands": { "terms": { "field": "brand.keyword", "size": 10 } }, "price_ranges": { "range": { "field": "price", "ranges": [ { "to": 500 }, { "from": 500, "to": 1000 }, { "from": 1000 } ] } } } }

Performance note: Aggregations are expensive. Use doc_values (columnar storage) for facet fields. Pre-compute facets for high-traffic queries.


Scaling Search

Index Sharding

By hash (most common):

shard_number = hash(doc_id) % num_shards

Pros: Even distribution
Cons: Cannot add shards without reindexing (shard count fixed at index creation)

By time (for logs/events):

Index per time period:
  logs-2026-04-01
  logs-2026-04-02
  logs-2026-04-03

Query: search across all indices matching "logs-2026-04-*"
Delete old data: just drop the old index (instant, no tombstones)

Scatter-Gather Query Pattern

Coordinator Node
  |
  +-- query --> Shard 0  --> top 10 results
  +-- query --> Shard 1  --> top 10 results
  +-- query --> Shard 2  --> top 10 results
  |
  <-- merge + re-rank top 30 --> return top 10

Two-phase process:

  1. Query phase: Each shard returns doc IDs + scores
  2. Fetch phase: Coordinator fetches full documents from relevant shards

Deep pagination problem: Page 1000 requires each shard to return 10,000 results. Use search_after or scroll API instead of from + size.

Routing

Skip the scatter step by routing to a specific shard:

# Index with routing
PUT /users/_doc/123?routing=user_456

# Query with routing (hits only one shard)
GET /users/_search?routing=user_456

Useful when data naturally partitions by tenant, user, or region.

Shard Sizing Guidelines

GuidelineRecommendation
Shard size10-50 GB per shard
Shards per node< 20 per GB of heap
Oversharding riskEach shard has ~10 MB heap overhead
Too few shardsCannot scale reads, creates hot spots

Real-Time vs Batch Indexing

ApproachLatencyThroughputUse Case
Real-time (sync)ImmediateLowerE-commerce product updates
Near real-time1-5 secondsGoodSocial media posts, news
BatchMinutes to hoursHighestAnalytics, full reindex

Near Real-Time Pipeline

  Data Source --> Kafka --> Indexing Consumer --> Elasticsearch
                   |
              [Buffer + batch to ES bulk API]
              [Typical: batch of 1000 docs every 5s]

Hybrid approach (common in practice):

  • Batch index the full dataset nightly
  • Real-time index incremental updates during the day
  • Periodic full reindex to correct drift

Interview Tips

  1. Start with the inverted index. If asked "how does search work," begin here. It shows you understand fundamentals, not just Elasticsearch APIs.
  2. Know BM25. Explain TF saturation and length normalization in one sentence: "BM25 prevents long documents and keyword-stuffed content from dominating results."
  3. Scatter-gather is the distributed search pattern. Draw the two-phase query + fetch process.
  4. Autocomplete is its own sub-system. Do not bolt it onto full-text search; explain why it needs a separate data structure (trie, completion suggester, edge n-gram).
  5. Discuss the write path. Immutable segments, refresh interval, and the trade-off between indexing speed and search freshness.
  6. Relevance is hard. Pure text relevance is rarely enough; business signals (popularity, recency, personalization) always matter in production.

Resources

  • DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval (inverted indexes, full-text search)
  • System Design Interview Vol. 2 (Alex Xu) -- Chapter on search autocomplete
  • Elasticsearch: The Definitive Guide (Clinton Gormley, Zachary Tong)
  • Lucene in Action (McCandless, Hatcher, Gospodnetic)
  • "Introduction to Information Retrieval" (Manning, Raghavan, Schutze) -- free online textbook
  • Elasticsearch Documentation: BM25 similarity, analysis, aggregations
  • Robertson & Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)
  • Doug Turnbull, "Relevant Search" (Manning Publications)

Previous: 20 - Object Storage & File Systems | Next: 22 - Time-Series & Analytics Databases