18 - Unique ID Generation

Previous: 17 - Rate Limiting & Throttling | Next: 19 - Bloom Filters & Probabilistic Data Structures

Why Distributed ID Generation Is Hard

In a single database, AUTO_INCREMENT gives you unique, ordered IDs for free. In distributed systems there is no single authority to hand out the next number. You need IDs that are:

Globally unique across all nodes, data centers, and time
Generated without coordination (no single point of failure)
Roughly sortable (feeds, pagination, event ordering)
Compact enough to serve as efficient primary keys and index entries

The choice of ID scheme directly impacts database performance (B-tree insert patterns), API design (URL length), debugging (can you decode when an event happened?), and even security (do sequential IDs leak business metrics?).

Approaches at a Glance

Approach	Size	Sortable	Coordination	Uniqueness	Clock Dep.
UUID v4	128 bit	No	None	Statistical	No
Auto-increment	32/64 bit	Yes	Single DB	Absolute	No
DB Ticket Server	64 bit	Yes	Central server(s)	Absolute	No
Twitter Snowflake	64 bit	Yes (time)	None (config)	Structural	Yes
ULID	128 bit	Yes (time)	None	Statistical+time	Yes
MongoDB ObjectID	96 bit	Yes (time)	None	Structural	Yes
Instagram ID	64 bit	Yes (time)	DB-based	Structural	Yes

Approach Deep Dives

1. UUID v4 (Random)

550e8400-e29b-41d4-a716-446655440000
|--------|----|----|----|-----------:|
 random    v4  var   random

128 bits, of which 122 are random. The collision probability for N generated UUIDs:

P(collision) ~ N^2 / (2 * 2^122)

For 1 billion UUIDs:
  P ~ (10^9)^2 / (2 * 5.3 * 10^36) ~ 10^-19   (negligible)

Pros:

Zero coordination -- any node generates independently
No clock dependency
Available in every language's standard library

Cons:

Not sortable -- random inserts cause B-tree page splits and write amplification
36 characters as string -- large in URLs and storage
No embedded metadata (no timestamp, no origin info)

When to use: Idempotency keys, correlation IDs, internal references where sort order is irrelevant.

2. Database Auto-Increment (The Problem)

sql
CREATE TABLE users (
  id BIGINT AUTO_INCREMENT PRIMARY KEY
);

Works perfectly on one database. Breaks in distributed systems:

Single point of failure -- one DB owns the sequence
Bottleneck -- every insert must coordinate with the authority
Replication conflicts -- two replicas may generate the same ID
Information leakage -- sequential IDs reveal business metrics (order volume, user count)

Workaround (odd/even):

DB1: 1, 3, 5, 7, 9, ...
DB2: 2, 4, 6, 8, 10, ...

Fragile: adding a third DB requires renumbering the entire scheme.

3. Database Ticket Server (Flickr Approach)

Two dedicated MySQL servers whose only job is generating IDs.

Ticket Server A:  auto_increment_increment = 2, offset = 1
                  --> 1, 3, 5, 7, ...

Ticket Server B:  auto_increment_increment = 2, offset = 2
                  --> 2, 4, 6, 8, ...

  +--------+      +------------------+
  | App    | ---> | Ticket Server A  |  (odd IDs)
  | Server |      +------------------+
  |        |
  |        | ---> | Ticket Server B  |  (even IDs)
  +--------+      +------------------+
        Round-robin between A and B

Pros:

64-bit numeric IDs (compact, sortable)
Simple to understand and operate
Two servers provide basic redundancy

Cons:

Still a centralized dependency
Network hop for every ID generation
Scaling to more than 2 servers requires changing the increment/offset scheme

Used by: Flickr (historically).

4. Twitter Snowflake (The Gold Standard)

64-bit ID packed with structured information. No runtime coordination.

Bit layout (64 bits total):

 1 bit   41 bits            5 bits    5 bits      12 bits
+-----+--------------------+--------+---------+------------+
|  0  |   Timestamp (ms)   | DC ID  | Machine | Sequence   |
+-----+--------------------+--------+---------+------------+
  ^          ^                 ^        ^           ^
  |          |                 |        |           |
  sign     ms since          0-31     0-31       0-4095
  bit      custom epoch                          per ms

Capacity math:

41 bits timestamp: 2^41 ms = ~69.7 years from custom epoch
5 bits datacenter: 32 datacenters
5 bits machine: 32 machines per DC = 1,024 machines total
12 bits sequence: 4,096 IDs per millisecond per machine
Peak: ~4 million IDs/sec per machine

Generation algorithm:

Get current timestamp in milliseconds
If same millisecond as last ID, increment sequence
If sequence overflows (>4095), spin-wait for next millisecond
Combine: (timestamp << 22) | (dc_id << 17) | (machine_id << 12) | sequence

Pros:

64 bits -- fits in BIGINT, excellent B-tree indexing
Time-sortable -- IDs created later are always numerically larger
No runtime coordination -- each machine generates independently
Extractable metadata -- decode timestamp, DC, machine from any ID

Cons:

Clock dependency -- clock rollback can cause duplicates
Requires machine/DC ID assignment (one-time setup via ZooKeeper, etcd, or static config)
69-year limit from epoch (choose your epoch wisely)

Clock skew mitigation:

Use NTP with continuous monitoring
If clock goes backward, refuse to generate IDs until it catches up
Alert on clock skew events

5. ULID (Universally Unique Lexicographically Sortable Identifier)

01ARZ3NDEKTSV4RRFFQ69G5FAV
|----------|---------------|
 10 chars     16 chars
 timestamp    randomness
 (48 bits)    (80 bits)

128 bits total: 48-bit timestamp (ms precision) + 80-bit random
Crockford's Base32 encoding (26 characters, case-insensitive, no ambiguous chars)
Lexicographically sortable as raw strings

Pros:

String-sortable (unlike UUID v4)
Zero coordination
High entropy (80 random bits vs Snowflake's 12 sequence bits)
Drop-in for UUID storage columns (same 128-bit size)

Cons:

128 bits (twice Snowflake)
Clock dependency for ordering correctness
Monotonicity within same millisecond not guaranteed by spec (some implementations add monotonic random increment)

When to use: Event logs, audit trails, any system needing UUID-like simplicity plus sortability.

6. MongoDB ObjectID

|---4 bytes---|---5 bytes---|---3 bytes---|
   timestamp     random        counter
   (seconds)     (per-process) (incrementing)

96 bits (12 bytes), displayed as 24-character hex string
Timestamp is in seconds (not ms) -- coarser ordering
5 bytes of process-scoped randomness
3-byte incrementing counter for same-second generation

Pros: Built into MongoDB, roughly sortable, no coordination required. Cons: 96 bits (awkward size for non-MongoDB systems), second-level precision only, MongoDB-specific format.

7. Instagram ID (PostgreSQL Sharding)

Built on PostgreSQL with logical shards:

  41 bits timestamp | 13 bits shard ID | 10 bits auto-increment

Each PostgreSQL shard has a PL/pgSQL function:

sql
CREATE OR REPLACE FUNCTION next_id(OUT result BIGINT) AS $$
DECLARE
    our_epoch BIGINT := 1314220021721;
    seq_id BIGINT;
    now_ms BIGINT;
    shard_id INT := 5;  -- unique per shard
BEGIN
    SELECT nextval('table_id_seq') % 1024 INTO seq_id;
    SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_ms;
    result := (now_ms - our_epoch) << 23;
    result := result | (shard_id << 10);
    result := result | (seq_id);
END;
$$ LANGUAGE plpgsql;

Pros: 64-bit, sortable, no external dependency beyond PostgreSQL. Cons: Tied to DB sharding strategy, requires careful shard ID management.

Trade-Off Matrix

Factor	UUID v4	Snowflake	ULID	ObjectID	Ticket Server
Size (bits)	128	64	128	96	64
Time sortable	No	Yes	Yes	Roughly	Yes
Coordination	None	Config only	None	None	Runtime
Throughput	Unlimited	4K/ms/node	Unlimited	~16M/s	DB-bound
Clock sensitive	No	Yes	Yes	Yes	No
Index friendly	Poor	Excellent	Good	Good	Excellent
Supports deletion	N/A	N/A	N/A	N/A	N/A
Info leakage	None	Time+DC	Time	Time	Ordering

Decision Flowchart

Need 64-bit numeric IDs?
  |
  +-- Yes --> Need time-sortability?
  |             |
  |             +-- Yes --> Snowflake / Instagram ID
  |             |
  |             +-- No  --> Ticket Server
  |
  +-- No  --> Need string sortability?
                |
                +-- Yes --> ULID
                |
                +-- No  --> UUID v4

Decision factors:

Database engine: B-tree indexes strongly favor sequential/sorted keys
Scale: Snowflake handles millions/sec; ticket servers hit limits quickly
Multi-region: Snowflake excels (DC bits); ticket servers struggle across regions
Simplicity: UUID v4 if you only need uniqueness and nothing else
Security: Random IDs (UUID v4, ULID) do not leak generation patterns

Interview Tips

Don't default to UUID. Show you understand the B-tree write amplification problem with random keys.
Know Snowflake cold. Draw the bit layout, explain clock skew handling, calculate throughput.
Discuss B-tree implications. Random UUIDs cause page splits; sorted IDs append sequentially.
Mention epoch choice. A custom epoch (not Unix 1970) maximizes the usable 69-year range.
Address failure modes. Clock rollback, machine ID reuse, sequence overflow.
Multi-tenancy angle. Embed tenant or shard information in ID bits for routing.

Common Follow-Up Questions

"How do you assign machine IDs?" -- ZooKeeper, etcd, or static configuration
"What if two machines get the same machine ID?" -- DC + machine bits reduce risk; monitor with alerts
"How do you handle clock rollback?" -- Wait, refuse generation, or use a logical clock hybrid
"Can you extract the creation time from an ID?" -- Yes for Snowflake/ULID/ObjectID; show the bit-shift math

Resources

DDIA (Kleppmann) -- Chapter 6: Partitioning (unique ID generation for distributed keys)
System Design Interview Vol. 1 (Alex Xu) -- Chapter 7: Design a Unique ID Generator
Twitter Engineering Blog: "Announcing Snowflake" (2010)
Instagram Engineering: "Sharding & IDs at Instagram"
ULID Specification: https://github.com/ulid/spec
MongoDB ObjectID documentation
Flickr Engineering: "Ticket Servers: Distributed Unique Primary Keys on the Cheap"

Previous: 17 - Rate Limiting & Throttling | Next: 19 - Bloom Filters & Probabilistic Data Structures