19 - Bloom Filters & Probabilistic Data Structures
Previous: 18 - Unique ID Generation | Next: 20 - Object Storage & File Systems
Why Probabilistic Data Structures Matter at Scale
At FAANG scale, exact answers are sometimes too expensive. When you have billions of items, even a hash set requires enormous memory. Probabilistic data structures trade a small, bounded error rate for massive savings in memory and speed.
Core principle: Accept a small probability of error to gain orders-of-magnitude improvements in space and time.
| Problem | Exact Cost | Probabilistic Cost | Structure |
|---|---|---|---|
| "Have I seen this URL?" (1B items) | ~16 GB (hash set) | ~1.2 GB (1% FP) | Bloom filter |
| "How many distinct users?" (1B events) | ~16 GB | ~12 KB (0.81% error) | HyperLogLog |
| "How often does this word appear?" | O(n) counters | O(1) per query | Count-Min Sketch |
Bloom Filters
How They Work
A Bloom filter is a bit array of size m with k independent hash functions.
Insert("apple"):
h1("apple") = 3 h2("apple") = 7 h3("apple") = 11
Bit array (m = 16):
[0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0]
^ ^ ^
3 7 11
Query("apple"):
Check bits 3, 7, 11 --> all 1 --> "PROBABLY IN SET"
Query("banana"):
h1("banana") = 3 h2("banana") = 9 h3("banana") = 14
Check bits 3, 9, 14 --> bit 9 is 0 --> "DEFINITELY NOT IN SET"
Key properties:
- No false negatives: If the filter says "not present," it is definitely not present
- Possible false positives: If the filter says "present," it might not be (bits set by other items)
- No deletions: You cannot remove items (clearing bits may affect other items)
False Positive Rate
The probability of a false positive with n items inserted:
FP rate = (1 - e^(-kn/m))^k
Where:
m = number of bits
k = number of hash functions
n = number of items inserted
Optimal number of hash functions:
k_optimal = (m/n) * ln(2) ~ 0.693 * (m/n)
Sizing Formula
Given a desired false positive rate p and expected number of items n:
m = -(n * ln(p)) / (ln(2))^2
Example: n = 1 billion, p = 1% (0.01)
m = -(1e9 * ln(0.01)) / (ln(2))^2
m = -(1e9 * (-4.605)) / 0.4805
m ~ 9.585 billion bits ~ 1.2 GB
k = 0.693 * (9.585e9 / 1e9) ~ 7 hash functions
Sizing Quick Reference
| Items (n) | FP Rate (p) | Bits (m) | Memory | Hash Functions (k) |
|---|---|---|---|---|
| 1M | 1% | 9.6M | 1.2 MB | 7 |
| 10M | 1% | 96M | 12 MB | 7 |
| 100M | 1% | 960M | 120 MB | 7 |
| 1B | 1% | 9.6B | 1.2 GB | 7 |
| 1B | 0.1% | 14.4B | 1.8 GB | 10 |
Use Cases
- Cache penetration prevention: Check the Bloom filter before hitting the database. If the filter says "not present," skip the DB query entirely. Prevents attackers from flooding your cache with requests for non-existent keys.
Request for key "xyz123"
|
v
+---------------+ miss +-------+ miss +----+
| Bloom Filter | ------------> | Cache | ------------> | DB |
| "Definitely | +-------+ +----+
| not in DB" |
+---------------+
|
"Not in set" --> Return 404 immediately (no DB hit)
-
Web crawler duplicate URL detection: Before crawling a URL, check if already visited. False positives mean skipping a few URLs (acceptable). False negatives would mean infinite loops (never happens with Bloom).
-
Spell checker: Dictionary of valid words in a Bloom filter. "Probably valid" means accept; "definitely not" means flag as misspelling.
-
Recommendation "already seen" filter: Per-user Bloom filter of consumed content IDs. A false positive hides one unseen item (minor). No false negatives means never re-showing content.
-
Distributed databases (Cassandra, HBase, RocksDB): Each SSTable has a Bloom filter. Before reading a file from disk, check the filter. Avoids unnecessary disk I/O for keys that are not in that SSTable.
Bloom Filter Variants
| Variant | Feature | Use Case |
|---|---|---|
| Counting Bloom | Replace bits with counters (4-bit) | Supports deletion |
| Scalable Bloom | Chain of growing filters | Unknown cardinality |
| Partitioned Bloom | Divide bit array among hash functions | Better cache behavior |
| Cuckoo Filter | Fingerprints in buckets | Deletion + better space at low FP |
Count-Min Sketch
Estimates the frequency of items in a stream. Think "approximate counter for everything."
How It Works
d hash functions, each mapping to a row of w counters:
w columns
+---+---+---+---+---+---+---+
h1: | 0 | 2 | 0 | 1 | 0 | 3 | 0 |
+---+---+---+---+---+---+---+
h2: | 1 | 0 | 0 | 0 | 2 | 0 | 0 |
+---+---+---+---+---+---+---+
h3: | 0 | 0 | 3 | 0 | 0 | 1 | 0 |
+---+---+---+---+---+---+---+
Insert("cat"):
h1("cat") = 3 --> row 1, col 3: increment
h2("cat") = 4 --> row 2, col 4: increment
h3("cat") = 2 --> row 3, col 2: increment
Query("cat"):
min(row1[3], row2[4], row3[2]) = min(1, 2, 3) = 1
Estimate >= true count (never underestimates)
Take minimum across rows to minimize overcount
Error Bounds
With probability >= 1 - delta:
estimate <= true_count + epsilon * N
Where:
N = total number of events
w = ceil(e / epsilon) -- width
d = ceil(ln(1 / delta)) -- depth (number of hash functions)
Example: epsilon = 0.001, delta = 0.01
w = ceil(2.718 / 0.001) = 2,719 columns
d = ceil(ln(100)) = ceil(4.6) = 5 rows
Total cells = 2,719 * 5 = 13,595
Memory (4 bytes/counter) ~ 53 KB
Use Cases
| Use Case | Why CMS |
|---|---|
| Heavy hitter detection | Find top-K most frequent items in a stream |
| Network traffic monitoring | Count packets per source IP without storing all IPs |
| Rate limiting by key | Approximate request counts per API key |
| Click fraud detection | Detect abnormally high click counts per ad |
| NLP word frequency | Approximate term frequency in large corpora |
HyperLogLog (Cardinality Estimation)
Answers: "How many distinct elements are in this set?" using remarkably little memory.
The Intuition
Hash each element. Look at the binary representation. The more leading zeros you see, the more unique items you have likely processed.
If you've hashed many items and the longest run of leading zeros is k,
then you've likely seen about 2^k distinct items.
Example:
hash("alice") = 0001... --> 3 leading zeros
hash("bob") = 0000001... --> 6 leading zeros
After many items, max leading zeros = 14
Estimate ~ 2^14 = 16,384 distinct items
How It Actually Works
- Hash each element to a uniform bit string
- Use the first
pbits to assign the element to one of2^pregisters (buckets) - Count leading zeros in the remaining bits + 1; store the maximum per register
- Combine all registers using a harmonic mean with bias correction
Registers (p = 4, so 16 registers):
Register: [ 3 5 2 7 4 6 3 5 8 2 4 6 3 5 7 4 ]
Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Estimate = alpha * m^2 / sum(2^(-register[i]))
Where alpha is a bias correction constant and m = 2^p
Memory and Accuracy
Standard error = 1.04 / sqrt(m)
Where m = number of registers = 2^p
p = 14 --> 2^14 = 16,384 registers
Each register: 6 bits (max value 64)
Total memory: 16,384 * 6 / 8 = 12 KB
Standard error: 1.04 / sqrt(16384) = 0.81%
12 KB to count billions of distinct items with <1% error. This is why HyperLogLog is everywhere.
Redis HyperLogLog
PFADD visitors "user_123"
PFADD visitors "user_456"
PFADD visitors "user_123" -- duplicate, no effect
PFCOUNT visitors -- returns 2
-- Merge multiple HLLs (e.g., from different servers or time windows)
PFMERGE all_visitors visitors_server1 visitors_server2
PFCOUNT all_visitors
Redis uses 12 KB per HyperLogLog key with p = 14.
Mergeability is a killer feature: you can count distinct users across time windows or servers by merging HLLs without accessing the original data.
Use Cases
| Use Case | Exact Alternative | HLL Advantage |
|---|---|---|
| Unique visitors per page | SET of user IDs (GB+) | 12 KB per page |
| Distinct search queries/day | Hash set (GB+) | 12 KB per day |
| Unique IPs hitting a service | Hash set | 12 KB per endpoint |
| Cardinality of DB columns | Full scan + dedup | Fast approximate for query planning |
Cuckoo Filters
An alternative to Bloom filters that supports deletion and can be more space-efficient at low false positive rates.
How It Works
Two possible bucket positions per item:
b1 = hash(x)
b2 = b1 XOR hash(fingerprint(x))
Insert: store fingerprint in b1 or b2
If both full: evict existing item to its alternate location (cuckoo hashing)
Bucket 0: [fp_a, fp_c]
Bucket 1: [fp_b, ____]
Bucket 2: [fp_d, fp_e]
Bucket 3: [____, ____]
Delete: remove the fingerprint from whichever bucket holds it
Bloom vs Cuckoo
| Feature | Bloom Filter | Cuckoo Filter |
|---|---|---|
| Deletion | No (unless counting) | Yes |
| Space at <3% FP | Bloom wins | Cuckoo wins |
| Space at >3% FP | Bloom wins | Bloom wins |
| Lookup speed | k hash computations | 2 lookups (constant) |
| Max load factor | N/A | ~95% |
| Implementation | Simple | Moderate |
Rule of thumb: Use Cuckoo when you need deletion or when the target FP rate is below 3%.
Practical Interview Applications
Scenario 1: URL Shortener Deduplication
Write path:
1. Check Bloom filter for long_url
2. If "not present" --> definitely new, create short URL
3. If "maybe present" --> query database to confirm
- Truly exists --> return existing short URL
- False positive --> create new, add to filter
Eliminates ~99% of database lookups for new URLs.
Scenario 2: Web Crawler Seen-URL Tracking
Crawl loop:
for url in frontier_queue:
if bloom_filter.might_contain(url):
skip -- already crawled (or false positive, acceptable)
else:
crawl(url)
bloom_filter.add(url)
1 billion URLs at 1% FP --> ~1.2 GB
vs. hash set --> ~16+ GB
Scenario 3: Real-Time Analytics Dashboard (HLL)
Each web server:
PFADD page:/home user_id
At query time:
PFMERGE daily_uniques server1:/home server2:/home ...
PFCOUNT daily_uniques --> approximate unique visitors
Supports arbitrary time-range queries by merging hourly/daily HLLs.
Scenario 4: Rate Limiter (Count-Min Sketch)
For each request:
cms.increment(api_key)
On query:
if cms.estimate(api_key) > rate_limit:
reject(429)
Overcounting = occasionally throttle slightly early (safe direction).
Comparison Matrix
| Structure | Query | Space | Error Type | Deletions | Mergeable |
|---|---|---|---|---|---|
| Bloom Filter | Membership | O(n) bits | False positives | No | OR of bit arrays |
| Cuckoo Filter | Membership | O(n) bits | False positives | Yes | No (generally) |
| Count-Min Sketch | Frequency | O(1/eps * ln(1/delta)) | Overestimate | Decrement (lossy) | Sum of counters |
| HyperLogLog | Cardinality | O(1) ~12 KB | +/- ~0.81% | No | Max of registers |
Interview Tips
- Know when to reach for probabilistic DS. "Billions of items" or "memory is limited" are your signals.
- Explain the error trade-off upfront. "We accept a 1% false positive rate to reduce memory from 16 GB to 1.2 GB."
- Bloom filters are the most commonly asked. Know the sizing formula, false positive rate, and at least two real-world uses.
- HyperLogLog comes up in analytics design. "Count distinct users" at scale is a classic prompt.
- Don't over-engineer. If the dataset fits in memory, use a hash set. Probabilistic structures shine at scale.
- Mention the no-false-negative property. This is what makes Bloom filters safe for cache lookups and dedup.
Resources
- DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval (Bloom filters in LSM-trees)
- System Design Interview Vol. 2 (Alex Xu) -- Bloom filter in web crawler chapter
- Original Paper: "Space/Time Trade-offs in Hash Coding with Allowable Errors" (Bloom, 1970)
- "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" (Flajolet et al., 2007)
- "An Improved Data Stream Summary: The Count-Min Sketch" (Cormode & Muthukrishnan, 2005)
- "Cuckoo Filter: Practically Better Than Bloom" (Fan et al., 2014)
- Redis Documentation: HyperLogLog commands (PFADD, PFCOUNT, PFMERGE)
- RedisBloom module documentation (BF.ADD, BF.EXISTS)
- Google Bigtable Paper: Bloom filter usage in SSTable lookups
Previous: 18 - Unique ID Generation | Next: 20 - Object Storage & File Systems