19 - Bloom Filters & Probabilistic Data Structures

Previous: 18 - Unique ID Generation | Next: 20 - Object Storage & File Systems

Why Probabilistic Data Structures Matter at Scale

At FAANG scale, exact answers are sometimes too expensive. When you have billions of items, even a hash set requires enormous memory. Probabilistic data structures trade a small, bounded error rate for massive savings in memory and speed.

Core principle: Accept a small probability of error to gain orders-of-magnitude improvements in space and time.

Problem	Exact Cost	Probabilistic Cost	Structure
"Have I seen this URL?" (1B items)	~16 GB (hash set)	~1.2 GB (1% FP)	Bloom filter
"How many distinct users?" (1B events)	~16 GB	~12 KB (0.81% error)	HyperLogLog
"How often does this word appear?"	O(n) counters	O(1) per query	Count-Min Sketch

Bloom Filters

How They Work

A Bloom filter is a bit array of size m with k independent hash functions.

Insert("apple"):
  h1("apple") = 3    h2("apple") = 7    h3("apple") = 11

  Bit array (m = 16):
  [0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0]
           ^           ^           ^
           3           7          11

Query("apple"):
  Check bits 3, 7, 11 --> all 1 --> "PROBABLY IN SET"

Query("banana"):
  h1("banana") = 3    h2("banana") = 9    h3("banana") = 14
  Check bits 3, 9, 14 --> bit 9 is 0 --> "DEFINITELY NOT IN SET"

Key properties:

No false negatives: If the filter says "not present," it is definitely not present
Possible false positives: If the filter says "present," it might not be (bits set by other items)
No deletions: You cannot remove items (clearing bits may affect other items)

False Positive Rate

The probability of a false positive with n items inserted:

FP rate = (1 - e^(-kn/m))^k

Where:
  m = number of bits
  k = number of hash functions
  n = number of items inserted

Optimal number of hash functions:

k_optimal = (m/n) * ln(2) ~ 0.693 * (m/n)

Sizing Formula

Given a desired false positive rate p and expected number of items n:

m = -(n * ln(p)) / (ln(2))^2

Example: n = 1 billion, p = 1% (0.01)
  m = -(1e9 * ln(0.01)) / (ln(2))^2
  m = -(1e9 * (-4.605)) / 0.4805
  m ~ 9.585 billion bits ~ 1.2 GB

  k = 0.693 * (9.585e9 / 1e9) ~ 7 hash functions

Sizing Quick Reference

Items (n)	FP Rate (p)	Bits (m)	Memory	Hash Functions (k)
1M	1%	9.6M	1.2 MB	7
10M	1%	96M	12 MB	7
100M	1%	960M	120 MB	7
1B	1%	9.6B	1.2 GB	7
1B	0.1%	14.4B	1.8 GB	10

Use Cases

Cache penetration prevention: Check the Bloom filter before hitting the database. If the filter says "not present," skip the DB query entirely. Prevents attackers from flooding your cache with requests for non-existent keys.

Request for key "xyz123"
  |
  v
+---------------+     miss      +-------+     miss      +----+
| Bloom Filter  | ------------> | Cache | ------------> | DB |
| "Definitely   |               +-------+               +----+
|  not in DB"   |
+---------------+
  |
  "Not in set" --> Return 404 immediately (no DB hit)

Web crawler duplicate URL detection: Before crawling a URL, check if already visited. False positives mean skipping a few URLs (acceptable). False negatives would mean infinite loops (never happens with Bloom).
Spell checker: Dictionary of valid words in a Bloom filter. "Probably valid" means accept; "definitely not" means flag as misspelling.
Recommendation "already seen" filter: Per-user Bloom filter of consumed content IDs. A false positive hides one unseen item (minor). No false negatives means never re-showing content.
Distributed databases (Cassandra, HBase, RocksDB): Each SSTable has a Bloom filter. Before reading a file from disk, check the filter. Avoids unnecessary disk I/O for keys that are not in that SSTable.

Bloom Filter Variants

Variant	Feature	Use Case
Counting Bloom	Replace bits with counters (4-bit)	Supports deletion
Scalable Bloom	Chain of growing filters	Unknown cardinality
Partitioned Bloom	Divide bit array among hash functions	Better cache behavior
Cuckoo Filter	Fingerprints in buckets	Deletion + better space at low FP

Count-Min Sketch

Estimates the frequency of items in a stream. Think "approximate counter for everything."

How It Works

d hash functions, each mapping to a row of w counters:

         w columns
     +---+---+---+---+---+---+---+
h1:  | 0 | 2 | 0 | 1 | 0 | 3 | 0 |
     +---+---+---+---+---+---+---+
h2:  | 1 | 0 | 0 | 0 | 2 | 0 | 0 |
     +---+---+---+---+---+---+---+
h3:  | 0 | 0 | 3 | 0 | 0 | 1 | 0 |
     +---+---+---+---+---+---+---+

Insert("cat"):
  h1("cat") = 3 --> row 1, col 3: increment
  h2("cat") = 4 --> row 2, col 4: increment
  h3("cat") = 2 --> row 3, col 2: increment

Query("cat"):
  min(row1[3], row2[4], row3[2]) = min(1, 2, 3) = 1

  Estimate >= true count (never underestimates)
  Take minimum across rows to minimize overcount

Error Bounds

With probability >= 1 - delta:
  estimate <= true_count + epsilon * N

Where:
  N = total number of events
  w = ceil(e / epsilon)      -- width
  d = ceil(ln(1 / delta))    -- depth (number of hash functions)

Example: epsilon = 0.001, delta = 0.01
  w = ceil(2.718 / 0.001) = 2,719 columns
  d = ceil(ln(100)) = ceil(4.6) = 5 rows
  Total cells = 2,719 * 5 = 13,595
  Memory (4 bytes/counter) ~ 53 KB

Use Cases

Use Case	Why CMS
Heavy hitter detection	Find top-K most frequent items in a stream
Network traffic monitoring	Count packets per source IP without storing all IPs
Rate limiting by key	Approximate request counts per API key
Click fraud detection	Detect abnormally high click counts per ad
NLP word frequency	Approximate term frequency in large corpora

HyperLogLog (Cardinality Estimation)

Answers: "How many distinct elements are in this set?" using remarkably little memory.

The Intuition

Hash each element. Look at the binary representation. The more leading zeros you see, the more unique items you have likely processed.

If you've hashed many items and the longest run of leading zeros is k,
then you've likely seen about 2^k distinct items.

Example:
  hash("alice") = 0001...  --> 3 leading zeros
  hash("bob")   = 0000001... --> 6 leading zeros

  After many items, max leading zeros = 14
  Estimate ~ 2^14 = 16,384 distinct items

How It Actually Works

Hash each element to a uniform bit string
Use the first p bits to assign the element to one of 2^p registers (buckets)
Count leading zeros in the remaining bits + 1; store the maximum per register
Combine all registers using a harmonic mean with bias correction

Registers (p = 4, so 16 registers):

Register: [  3  5  2  7  4  6  3  5  8  2  4  6  3  5  7  4 ]
Index:      0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Estimate = alpha * m^2 / sum(2^(-register[i]))

Where alpha is a bias correction constant and m = 2^p

Memory and Accuracy

Standard error = 1.04 / sqrt(m)

Where m = number of registers = 2^p

p = 14 --> 2^14 = 16,384 registers
Each register: 6 bits (max value 64)
Total memory: 16,384 * 6 / 8 = 12 KB
Standard error: 1.04 / sqrt(16384) = 0.81%

12 KB to count billions of distinct items with <1% error. This is why HyperLogLog is everywhere.

Redis HyperLogLog

PFADD visitors "user_123"
PFADD visitors "user_456"
PFADD visitors "user_123"   -- duplicate, no effect

PFCOUNT visitors             -- returns 2

-- Merge multiple HLLs (e.g., from different servers or time windows)
PFMERGE all_visitors visitors_server1 visitors_server2
PFCOUNT all_visitors

Redis uses 12 KB per HyperLogLog key with p = 14.

Mergeability is a killer feature: you can count distinct users across time windows or servers by merging HLLs without accessing the original data.

Use Cases

Use Case	Exact Alternative	HLL Advantage
Unique visitors per page	SET of user IDs (GB+)	12 KB per page
Distinct search queries/day	Hash set (GB+)	12 KB per day
Unique IPs hitting a service	Hash set	12 KB per endpoint
Cardinality of DB columns	Full scan + dedup	Fast approximate for query planning

Cuckoo Filters

An alternative to Bloom filters that supports deletion and can be more space-efficient at low false positive rates.

How It Works

Two possible bucket positions per item:
  b1 = hash(x)
  b2 = b1 XOR hash(fingerprint(x))

Insert: store fingerprint in b1 or b2
If both full: evict existing item to its alternate location (cuckoo hashing)

  Bucket 0: [fp_a, fp_c]
  Bucket 1: [fp_b, ____]
  Bucket 2: [fp_d, fp_e]
  Bucket 3: [____, ____]

Delete: remove the fingerprint from whichever bucket holds it

Bloom vs Cuckoo

Feature	Bloom Filter	Cuckoo Filter
Deletion	No (unless counting)	Yes
Space at <3% FP	Bloom wins	Cuckoo wins
Space at >3% FP	Bloom wins	Bloom wins
Lookup speed	k hash computations	2 lookups (constant)
Max load factor	N/A	~95%
Implementation	Simple	Moderate

Rule of thumb: Use Cuckoo when you need deletion or when the target FP rate is below 3%.

Practical Interview Applications

Scenario 1: URL Shortener Deduplication

Write path:
  1. Check Bloom filter for long_url
  2. If "not present" --> definitely new, create short URL
  3. If "maybe present" --> query database to confirm
     - Truly exists --> return existing short URL
     - False positive --> create new, add to filter

Eliminates ~99% of database lookups for new URLs.

Scenario 2: Web Crawler Seen-URL Tracking

Crawl loop:
  for url in frontier_queue:
    if bloom_filter.might_contain(url):
      skip  -- already crawled (or false positive, acceptable)
    else:
      crawl(url)
      bloom_filter.add(url)

1 billion URLs at 1% FP --> ~1.2 GB
vs. hash set --> ~16+ GB

Scenario 3: Real-Time Analytics Dashboard (HLL)

Each web server:
  PFADD page:/home user_id

At query time:
  PFMERGE daily_uniques server1:/home server2:/home ...
  PFCOUNT daily_uniques  --> approximate unique visitors

Supports arbitrary time-range queries by merging hourly/daily HLLs.

Scenario 4: Rate Limiter (Count-Min Sketch)

For each request:
  cms.increment(api_key)

On query:
  if cms.estimate(api_key) > rate_limit:
    reject(429)

Overcounting = occasionally throttle slightly early (safe direction).

Comparison Matrix

Structure	Query	Space	Error Type	Deletions	Mergeable
Bloom Filter	Membership	O(n) bits	False positives	No	OR of bit arrays
Cuckoo Filter	Membership	O(n) bits	False positives	Yes	No (generally)
Count-Min Sketch	Frequency	O(1/eps * ln(1/delta))	Overestimate	Decrement (lossy)	Sum of counters
HyperLogLog	Cardinality	O(1) ~12 KB	+/- ~0.81%	No	Max of registers

Interview Tips

Know when to reach for probabilistic DS. "Billions of items" or "memory is limited" are your signals.
Explain the error trade-off upfront. "We accept a 1% false positive rate to reduce memory from 16 GB to 1.2 GB."
Bloom filters are the most commonly asked. Know the sizing formula, false positive rate, and at least two real-world uses.
HyperLogLog comes up in analytics design. "Count distinct users" at scale is a classic prompt.
Don't over-engineer. If the dataset fits in memory, use a hash set. Probabilistic structures shine at scale.
Mention the no-false-negative property. This is what makes Bloom filters safe for cache lookups and dedup.

Resources

DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval (Bloom filters in LSM-trees)
System Design Interview Vol. 2 (Alex Xu) -- Bloom filter in web crawler chapter
Original Paper: "Space/Time Trade-offs in Hash Coding with Allowable Errors" (Bloom, 1970)
"HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" (Flajolet et al., 2007)
"An Improved Data Stream Summary: The Count-Min Sketch" (Cormode & Muthukrishnan, 2005)
"Cuckoo Filter: Practically Better Than Bloom" (Fan et al., 2014)
Redis Documentation: HyperLogLog commands (PFADD, PFCOUNT, PFMERGE)
RedisBloom module documentation (BF.ADD, BF.EXISTS)
Google Bigtable Paper: Bloom filter usage in SSTable lookups

Previous: 18 - Unique ID Generation | Next: 20 - Object Storage & File Systems