19 - Bloom Filters & Probabilistic Data Structures

Previous: 18 - Unique ID Generation | Next: 20 - Object Storage & File Systems


Why Probabilistic Data Structures Matter at Scale

At FAANG scale, exact answers are sometimes too expensive. When you have billions of items, even a hash set requires enormous memory. Probabilistic data structures trade a small, bounded error rate for massive savings in memory and speed.

Core principle: Accept a small probability of error to gain orders-of-magnitude improvements in space and time.

ProblemExact CostProbabilistic CostStructure
"Have I seen this URL?" (1B items)~16 GB (hash set)~1.2 GB (1% FP)Bloom filter
"How many distinct users?" (1B events)~16 GB~12 KB (0.81% error)HyperLogLog
"How often does this word appear?"O(n) countersO(1) per queryCount-Min Sketch

Bloom Filters

How They Work

A Bloom filter is a bit array of size m with k independent hash functions.

Insert("apple"):
  h1("apple") = 3    h2("apple") = 7    h3("apple") = 11

  Bit array (m = 16):
  [0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0]
           ^           ^           ^
           3           7          11

Query("apple"):
  Check bits 3, 7, 11 --> all 1 --> "PROBABLY IN SET"

Query("banana"):
  h1("banana") = 3    h2("banana") = 9    h3("banana") = 14
  Check bits 3, 9, 14 --> bit 9 is 0 --> "DEFINITELY NOT IN SET"

Key properties:

  • No false negatives: If the filter says "not present," it is definitely not present
  • Possible false positives: If the filter says "present," it might not be (bits set by other items)
  • No deletions: You cannot remove items (clearing bits may affect other items)

False Positive Rate

The probability of a false positive with n items inserted:

FP rate = (1 - e^(-kn/m))^k

Where:
  m = number of bits
  k = number of hash functions
  n = number of items inserted

Optimal number of hash functions:

k_optimal = (m/n) * ln(2) ~ 0.693 * (m/n)

Sizing Formula

Given a desired false positive rate p and expected number of items n:

m = -(n * ln(p)) / (ln(2))^2

Example: n = 1 billion, p = 1% (0.01)
  m = -(1e9 * ln(0.01)) / (ln(2))^2
  m = -(1e9 * (-4.605)) / 0.4805
  m ~ 9.585 billion bits ~ 1.2 GB

  k = 0.693 * (9.585e9 / 1e9) ~ 7 hash functions

Sizing Quick Reference

Items (n)FP Rate (p)Bits (m)MemoryHash Functions (k)
1M1%9.6M1.2 MB7
10M1%96M12 MB7
100M1%960M120 MB7
1B1%9.6B1.2 GB7
1B0.1%14.4B1.8 GB10

Use Cases

  1. Cache penetration prevention: Check the Bloom filter before hitting the database. If the filter says "not present," skip the DB query entirely. Prevents attackers from flooding your cache with requests for non-existent keys.
Request for key "xyz123"
  |
  v
+---------------+     miss      +-------+     miss      +----+
| Bloom Filter  | ------------> | Cache | ------------> | DB |
| "Definitely   |               +-------+               +----+
|  not in DB"   |
+---------------+
  |
  "Not in set" --> Return 404 immediately (no DB hit)
  1. Web crawler duplicate URL detection: Before crawling a URL, check if already visited. False positives mean skipping a few URLs (acceptable). False negatives would mean infinite loops (never happens with Bloom).

  2. Spell checker: Dictionary of valid words in a Bloom filter. "Probably valid" means accept; "definitely not" means flag as misspelling.

  3. Recommendation "already seen" filter: Per-user Bloom filter of consumed content IDs. A false positive hides one unseen item (minor). No false negatives means never re-showing content.

  4. Distributed databases (Cassandra, HBase, RocksDB): Each SSTable has a Bloom filter. Before reading a file from disk, check the filter. Avoids unnecessary disk I/O for keys that are not in that SSTable.

Bloom Filter Variants

VariantFeatureUse Case
Counting BloomReplace bits with counters (4-bit)Supports deletion
Scalable BloomChain of growing filtersUnknown cardinality
Partitioned BloomDivide bit array among hash functionsBetter cache behavior
Cuckoo FilterFingerprints in bucketsDeletion + better space at low FP

Count-Min Sketch

Estimates the frequency of items in a stream. Think "approximate counter for everything."

How It Works

d hash functions, each mapping to a row of w counters:

         w columns
     +---+---+---+---+---+---+---+
h1:  | 0 | 2 | 0 | 1 | 0 | 3 | 0 |
     +---+---+---+---+---+---+---+
h2:  | 1 | 0 | 0 | 0 | 2 | 0 | 0 |
     +---+---+---+---+---+---+---+
h3:  | 0 | 0 | 3 | 0 | 0 | 1 | 0 |
     +---+---+---+---+---+---+---+

Insert("cat"):
  h1("cat") = 3 --> row 1, col 3: increment
  h2("cat") = 4 --> row 2, col 4: increment
  h3("cat") = 2 --> row 3, col 2: increment

Query("cat"):
  min(row1[3], row2[4], row3[2]) = min(1, 2, 3) = 1

  Estimate >= true count (never underestimates)
  Take minimum across rows to minimize overcount

Error Bounds

With probability >= 1 - delta:
  estimate <= true_count + epsilon * N

Where:
  N = total number of events
  w = ceil(e / epsilon)      -- width
  d = ceil(ln(1 / delta))    -- depth (number of hash functions)

Example: epsilon = 0.001, delta = 0.01
  w = ceil(2.718 / 0.001) = 2,719 columns
  d = ceil(ln(100)) = ceil(4.6) = 5 rows
  Total cells = 2,719 * 5 = 13,595
  Memory (4 bytes/counter) ~ 53 KB

Use Cases

Use CaseWhy CMS
Heavy hitter detectionFind top-K most frequent items in a stream
Network traffic monitoringCount packets per source IP without storing all IPs
Rate limiting by keyApproximate request counts per API key
Click fraud detectionDetect abnormally high click counts per ad
NLP word frequencyApproximate term frequency in large corpora

HyperLogLog (Cardinality Estimation)

Answers: "How many distinct elements are in this set?" using remarkably little memory.

The Intuition

Hash each element. Look at the binary representation. The more leading zeros you see, the more unique items you have likely processed.

If you've hashed many items and the longest run of leading zeros is k,
then you've likely seen about 2^k distinct items.

Example:
  hash("alice") = 0001...  --> 3 leading zeros
  hash("bob")   = 0000001... --> 6 leading zeros

  After many items, max leading zeros = 14
  Estimate ~ 2^14 = 16,384 distinct items

How It Actually Works

  1. Hash each element to a uniform bit string
  2. Use the first p bits to assign the element to one of 2^p registers (buckets)
  3. Count leading zeros in the remaining bits + 1; store the maximum per register
  4. Combine all registers using a harmonic mean with bias correction
Registers (p = 4, so 16 registers):

Register: [  3  5  2  7  4  6  3  5  8  2  4  6  3  5  7  4 ]
Index:      0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Estimate = alpha * m^2 / sum(2^(-register[i]))

Where alpha is a bias correction constant and m = 2^p

Memory and Accuracy

Standard error = 1.04 / sqrt(m)

Where m = number of registers = 2^p

p = 14 --> 2^14 = 16,384 registers
Each register: 6 bits (max value 64)
Total memory: 16,384 * 6 / 8 = 12 KB
Standard error: 1.04 / sqrt(16384) = 0.81%

12 KB to count billions of distinct items with <1% error. This is why HyperLogLog is everywhere.

Redis HyperLogLog

PFADD visitors "user_123"
PFADD visitors "user_456"
PFADD visitors "user_123"   -- duplicate, no effect

PFCOUNT visitors             -- returns 2

-- Merge multiple HLLs (e.g., from different servers or time windows)
PFMERGE all_visitors visitors_server1 visitors_server2
PFCOUNT all_visitors

Redis uses 12 KB per HyperLogLog key with p = 14.

Mergeability is a killer feature: you can count distinct users across time windows or servers by merging HLLs without accessing the original data.

Use Cases

Use CaseExact AlternativeHLL Advantage
Unique visitors per pageSET of user IDs (GB+)12 KB per page
Distinct search queries/dayHash set (GB+)12 KB per day
Unique IPs hitting a serviceHash set12 KB per endpoint
Cardinality of DB columnsFull scan + dedupFast approximate for query planning

Cuckoo Filters

An alternative to Bloom filters that supports deletion and can be more space-efficient at low false positive rates.

How It Works

Two possible bucket positions per item:
  b1 = hash(x)
  b2 = b1 XOR hash(fingerprint(x))

Insert: store fingerprint in b1 or b2
If both full: evict existing item to its alternate location (cuckoo hashing)

  Bucket 0: [fp_a, fp_c]
  Bucket 1: [fp_b, ____]
  Bucket 2: [fp_d, fp_e]
  Bucket 3: [____, ____]

Delete: remove the fingerprint from whichever bucket holds it

Bloom vs Cuckoo

FeatureBloom FilterCuckoo Filter
DeletionNo (unless counting)Yes
Space at <3% FPBloom winsCuckoo wins
Space at >3% FPBloom winsBloom wins
Lookup speedk hash computations2 lookups (constant)
Max load factorN/A~95%
ImplementationSimpleModerate

Rule of thumb: Use Cuckoo when you need deletion or when the target FP rate is below 3%.


Practical Interview Applications

Scenario 1: URL Shortener Deduplication

Write path:
  1. Check Bloom filter for long_url
  2. If "not present" --> definitely new, create short URL
  3. If "maybe present" --> query database to confirm
     - Truly exists --> return existing short URL
     - False positive --> create new, add to filter

Eliminates ~99% of database lookups for new URLs.

Scenario 2: Web Crawler Seen-URL Tracking

Crawl loop:
  for url in frontier_queue:
    if bloom_filter.might_contain(url):
      skip  -- already crawled (or false positive, acceptable)
    else:
      crawl(url)
      bloom_filter.add(url)

1 billion URLs at 1% FP --> ~1.2 GB
vs. hash set --> ~16+ GB

Scenario 3: Real-Time Analytics Dashboard (HLL)

Each web server:
  PFADD page:/home user_id

At query time:
  PFMERGE daily_uniques server1:/home server2:/home ...
  PFCOUNT daily_uniques  --> approximate unique visitors

Supports arbitrary time-range queries by merging hourly/daily HLLs.

Scenario 4: Rate Limiter (Count-Min Sketch)

For each request:
  cms.increment(api_key)

On query:
  if cms.estimate(api_key) > rate_limit:
    reject(429)

Overcounting = occasionally throttle slightly early (safe direction).

Comparison Matrix

StructureQuerySpaceError TypeDeletionsMergeable
Bloom FilterMembershipO(n) bitsFalse positivesNoOR of bit arrays
Cuckoo FilterMembershipO(n) bitsFalse positivesYesNo (generally)
Count-Min SketchFrequencyO(1/eps * ln(1/delta))OverestimateDecrement (lossy)Sum of counters
HyperLogLogCardinalityO(1) ~12 KB+/- ~0.81%NoMax of registers

Interview Tips

  1. Know when to reach for probabilistic DS. "Billions of items" or "memory is limited" are your signals.
  2. Explain the error trade-off upfront. "We accept a 1% false positive rate to reduce memory from 16 GB to 1.2 GB."
  3. Bloom filters are the most commonly asked. Know the sizing formula, false positive rate, and at least two real-world uses.
  4. HyperLogLog comes up in analytics design. "Count distinct users" at scale is a classic prompt.
  5. Don't over-engineer. If the dataset fits in memory, use a hash set. Probabilistic structures shine at scale.
  6. Mention the no-false-negative property. This is what makes Bloom filters safe for cache lookups and dedup.

Resources

  • DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval (Bloom filters in LSM-trees)
  • System Design Interview Vol. 2 (Alex Xu) -- Bloom filter in web crawler chapter
  • Original Paper: "Space/Time Trade-offs in Hash Coding with Allowable Errors" (Bloom, 1970)
  • "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" (Flajolet et al., 2007)
  • "An Improved Data Stream Summary: The Count-Min Sketch" (Cormode & Muthukrishnan, 2005)
  • "Cuckoo Filter: Practically Better Than Bloom" (Fan et al., 2014)
  • Redis Documentation: HyperLogLog commands (PFADD, PFCOUNT, PFMERGE)
  • RedisBloom module documentation (BF.ADD, BF.EXISTS)
  • Google Bigtable Paper: Bloom filter usage in SSTable lookups

Previous: 18 - Unique ID Generation | Next: 20 - Object Storage & File Systems