18 - Unique ID Generation

Previous: 17 - Rate Limiting & Throttling | Next: 19 - Bloom Filters & Probabilistic Data Structures


Why Distributed ID Generation Is Hard

In a single database, AUTO_INCREMENT gives you unique, ordered IDs for free. In distributed systems there is no single authority to hand out the next number. You need IDs that are:

  • Globally unique across all nodes, data centers, and time
  • Generated without coordination (no single point of failure)
  • Roughly sortable (feeds, pagination, event ordering)
  • Compact enough to serve as efficient primary keys and index entries

The choice of ID scheme directly impacts database performance (B-tree insert patterns), API design (URL length), debugging (can you decode when an event happened?), and even security (do sequential IDs leak business metrics?).


Approaches at a Glance

ApproachSizeSortableCoordinationUniquenessClock Dep.
UUID v4128 bitNoNoneStatisticalNo
Auto-increment32/64 bitYesSingle DBAbsoluteNo
DB Ticket Server64 bitYesCentral server(s)AbsoluteNo
Twitter Snowflake64 bitYes (time)None (config)StructuralYes
ULID128 bitYes (time)NoneStatistical+timeYes
MongoDB ObjectID96 bitYes (time)NoneStructuralYes
Instagram ID64 bitYes (time)DB-basedStructuralYes

Approach Deep Dives

1. UUID v4 (Random)

550e8400-e29b-41d4-a716-446655440000
|--------|----|----|----|-----------:|
 random    v4  var   random

128 bits, of which 122 are random. The collision probability for N generated UUIDs:

P(collision) ~ N^2 / (2 * 2^122)

For 1 billion UUIDs:
  P ~ (10^9)^2 / (2 * 5.3 * 10^36) ~ 10^-19   (negligible)

Pros:

  • Zero coordination -- any node generates independently
  • No clock dependency
  • Available in every language's standard library

Cons:

  • Not sortable -- random inserts cause B-tree page splits and write amplification
  • 36 characters as string -- large in URLs and storage
  • No embedded metadata (no timestamp, no origin info)

When to use: Idempotency keys, correlation IDs, internal references where sort order is irrelevant.

2. Database Auto-Increment (The Problem)

sql
CREATE TABLE users ( id BIGINT AUTO_INCREMENT PRIMARY KEY );

Works perfectly on one database. Breaks in distributed systems:

  • Single point of failure -- one DB owns the sequence
  • Bottleneck -- every insert must coordinate with the authority
  • Replication conflicts -- two replicas may generate the same ID
  • Information leakage -- sequential IDs reveal business metrics (order volume, user count)

Workaround (odd/even):

DB1: 1, 3, 5, 7, 9, ...
DB2: 2, 4, 6, 8, 10, ...

Fragile: adding a third DB requires renumbering the entire scheme.

3. Database Ticket Server (Flickr Approach)

Two dedicated MySQL servers whose only job is generating IDs.

Ticket Server A:  auto_increment_increment = 2, offset = 1
                  --> 1, 3, 5, 7, ...

Ticket Server B:  auto_increment_increment = 2, offset = 2
                  --> 2, 4, 6, 8, ...

  +--------+      +------------------+
  | App    | ---> | Ticket Server A  |  (odd IDs)
  | Server |      +------------------+
  |        |
  |        | ---> | Ticket Server B  |  (even IDs)
  +--------+      +------------------+
        Round-robin between A and B

Pros:

  • 64-bit numeric IDs (compact, sortable)
  • Simple to understand and operate
  • Two servers provide basic redundancy

Cons:

  • Still a centralized dependency
  • Network hop for every ID generation
  • Scaling to more than 2 servers requires changing the increment/offset scheme

Used by: Flickr (historically).

4. Twitter Snowflake (The Gold Standard)

64-bit ID packed with structured information. No runtime coordination.

Bit layout (64 bits total):

 1 bit   41 bits            5 bits    5 bits      12 bits
+-----+--------------------+--------+---------+------------+
|  0  |   Timestamp (ms)   | DC ID  | Machine | Sequence   |
+-----+--------------------+--------+---------+------------+
  ^          ^                 ^        ^           ^
  |          |                 |        |           |
  sign     ms since          0-31     0-31       0-4095
  bit      custom epoch                          per ms

Capacity math:

  • 41 bits timestamp: 2^41 ms = ~69.7 years from custom epoch
  • 5 bits datacenter: 32 datacenters
  • 5 bits machine: 32 machines per DC = 1,024 machines total
  • 12 bits sequence: 4,096 IDs per millisecond per machine
  • Peak: ~4 million IDs/sec per machine

Generation algorithm:

  1. Get current timestamp in milliseconds
  2. If same millisecond as last ID, increment sequence
  3. If sequence overflows (>4095), spin-wait for next millisecond
  4. Combine: (timestamp << 22) | (dc_id << 17) | (machine_id << 12) | sequence

Pros:

  • 64 bits -- fits in BIGINT, excellent B-tree indexing
  • Time-sortable -- IDs created later are always numerically larger
  • No runtime coordination -- each machine generates independently
  • Extractable metadata -- decode timestamp, DC, machine from any ID

Cons:

  • Clock dependency -- clock rollback can cause duplicates
  • Requires machine/DC ID assignment (one-time setup via ZooKeeper, etcd, or static config)
  • 69-year limit from epoch (choose your epoch wisely)

Clock skew mitigation:

  • Use NTP with continuous monitoring
  • If clock goes backward, refuse to generate IDs until it catches up
  • Alert on clock skew events

5. ULID (Universally Unique Lexicographically Sortable Identifier)

01ARZ3NDEKTSV4RRFFQ69G5FAV
|----------|---------------|
 10 chars     16 chars
 timestamp    randomness
 (48 bits)    (80 bits)
  • 128 bits total: 48-bit timestamp (ms precision) + 80-bit random
  • Crockford's Base32 encoding (26 characters, case-insensitive, no ambiguous chars)
  • Lexicographically sortable as raw strings

Pros:

  • String-sortable (unlike UUID v4)
  • Zero coordination
  • High entropy (80 random bits vs Snowflake's 12 sequence bits)
  • Drop-in for UUID storage columns (same 128-bit size)

Cons:

  • 128 bits (twice Snowflake)
  • Clock dependency for ordering correctness
  • Monotonicity within same millisecond not guaranteed by spec (some implementations add monotonic random increment)

When to use: Event logs, audit trails, any system needing UUID-like simplicity plus sortability.

6. MongoDB ObjectID

|---4 bytes---|---5 bytes---|---3 bytes---|
   timestamp     random        counter
   (seconds)     (per-process) (incrementing)
  • 96 bits (12 bytes), displayed as 24-character hex string
  • Timestamp is in seconds (not ms) -- coarser ordering
  • 5 bytes of process-scoped randomness
  • 3-byte incrementing counter for same-second generation

Pros: Built into MongoDB, roughly sortable, no coordination required. Cons: 96 bits (awkward size for non-MongoDB systems), second-level precision only, MongoDB-specific format.

7. Instagram ID (PostgreSQL Sharding)

Built on PostgreSQL with logical shards:

  41 bits timestamp | 13 bits shard ID | 10 bits auto-increment

Each PostgreSQL shard has a PL/pgSQL function:

sql
CREATE OR REPLACE FUNCTION next_id(OUT result BIGINT) AS $$ DECLARE our_epoch BIGINT := 1314220021721; seq_id BIGINT; now_ms BIGINT; shard_id INT := 5; -- unique per shard BEGIN SELECT nextval('table_id_seq') % 1024 INTO seq_id; SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_ms; result := (now_ms - our_epoch) << 23; result := result | (shard_id << 10); result := result | (seq_id); END; $$ LANGUAGE plpgsql;

Pros: 64-bit, sortable, no external dependency beyond PostgreSQL. Cons: Tied to DB sharding strategy, requires careful shard ID management.


Trade-Off Matrix

FactorUUID v4SnowflakeULIDObjectIDTicket Server
Size (bits)128641289664
Time sortableNoYesYesRoughlyYes
CoordinationNoneConfig onlyNoneNoneRuntime
ThroughputUnlimited4K/ms/nodeUnlimited~16M/sDB-bound
Clock sensitiveNoYesYesYesNo
Index friendlyPoorExcellentGoodGoodExcellent
Supports deletionN/AN/AN/AN/AN/A
Info leakageNoneTime+DCTimeTimeOrdering

Decision Flowchart

Need 64-bit numeric IDs?
  |
  +-- Yes --> Need time-sortability?
  |             |
  |             +-- Yes --> Snowflake / Instagram ID
  |             |
  |             +-- No  --> Ticket Server
  |
  +-- No  --> Need string sortability?
                |
                +-- Yes --> ULID
                |
                +-- No  --> UUID v4

Decision factors:

  1. Database engine: B-tree indexes strongly favor sequential/sorted keys
  2. Scale: Snowflake handles millions/sec; ticket servers hit limits quickly
  3. Multi-region: Snowflake excels (DC bits); ticket servers struggle across regions
  4. Simplicity: UUID v4 if you only need uniqueness and nothing else
  5. Security: Random IDs (UUID v4, ULID) do not leak generation patterns

Interview Tips

  1. Don't default to UUID. Show you understand the B-tree write amplification problem with random keys.
  2. Know Snowflake cold. Draw the bit layout, explain clock skew handling, calculate throughput.
  3. Discuss B-tree implications. Random UUIDs cause page splits; sorted IDs append sequentially.
  4. Mention epoch choice. A custom epoch (not Unix 1970) maximizes the usable 69-year range.
  5. Address failure modes. Clock rollback, machine ID reuse, sequence overflow.
  6. Multi-tenancy angle. Embed tenant or shard information in ID bits for routing.

Common Follow-Up Questions

  • "How do you assign machine IDs?" -- ZooKeeper, etcd, or static configuration
  • "What if two machines get the same machine ID?" -- DC + machine bits reduce risk; monitor with alerts
  • "How do you handle clock rollback?" -- Wait, refuse generation, or use a logical clock hybrid
  • "Can you extract the creation time from an ID?" -- Yes for Snowflake/ULID/ObjectID; show the bit-shift math

Resources

  • DDIA (Kleppmann) -- Chapter 6: Partitioning (unique ID generation for distributed keys)
  • System Design Interview Vol. 1 (Alex Xu) -- Chapter 7: Design a Unique ID Generator
  • Twitter Engineering Blog: "Announcing Snowflake" (2010)
  • Instagram Engineering: "Sharding & IDs at Instagram"
  • ULID Specification: https://github.com/ulid/spec
  • MongoDB ObjectID documentation
  • Flickr Engineering: "Ticket Servers: Distributed Unique Primary Keys on the Cheap"

Previous: 17 - Rate Limiting & Throttling | Next: 19 - Bloom Filters & Probabilistic Data Structures