Data Storage Fundamentals

Why This Matters

"Which database should we use?" comes up in every system design interview. The answer is always "it depends" — and you need to explain what it depends on.

SQL (Relational Databases)

Core Concepts

Tables with rows and columns, fixed schema
Relations via foreign keys (1:1, 1:N, M:N)
SQL — declarative query language
ACID transactions

ACID Properties

Property	Meaning	Example
Atomicity	All or nothing — transaction fully commits or fully rolls back	Bank transfer: debit AND credit both happen or neither
Consistency	Data satisfies all constraints after transaction	Balance never goes negative if constraint exists
Isolation	Concurrent transactions don't interfere	Two withdrawals don't see stale balance
Durability	Committed data survives crashes	Write-ahead log ensures data persists

Isolation Levels

Level	Dirty Read	Non-Repeatable Read	Phantom Read	Performance
Read Uncommitted	✅ possible	✅ possible	✅ possible	Fastest
Read Committed	❌ prevented	✅ possible	✅ possible	Fast
Repeatable Read	❌ prevented	❌ prevented	✅ possible	Medium
Serializable	❌ prevented	❌ prevented	❌ prevented	Slowest

Most production systems use Read Committed (PostgreSQL default) or Repeatable Read (MySQL InnoDB default).

When to Use SQL

Complex queries with JOINs
Transactions are critical (financial data, inventory)
Data has clear relationships
Need strong consistency
Schema is well-defined and stable

Popular SQL Databases

Database	Known For	Used By
PostgreSQL	Feature-rich, extensible, JSONB	Apple, Instagram, Uber
MySQL	Battle-tested, fast reads	Meta, Airbnb, GitHub
Amazon Aurora	MySQL/PG compatible, cloud-native	Many AWS users
Google Spanner	Globally distributed SQL	Google
CockroachDB	Distributed SQL, PostgreSQL wire-compatible

NoSQL Databases

BASE Properties (Contrast with ACID)

Property	Meaning
Basically Available	System guarantees availability (may return stale data)
Soft state	State may change over time even without new input
Eventually consistent	System will converge to consistent state eventually

Types of NoSQL

1. Key-Value Stores

Key → Value (opaque blob)
"user:123" → { serialized user data }

Use cases: Session storage, caching, simple lookups
Examples: Redis, Memcached, DynamoDB, etcd
Pros: Extremely fast, simple, horizontally scalable
Cons: No complex queries, no relationships

2. Document Stores

json
{
  "_id": "user_123",
  "name": "Zineddine",
  "orders": [
    { "id": "ord_1", "total": 59.99 },
    { "id": "ord_2", "total": 129.00 }
  ]
}

Use cases: Content management, user profiles, catalogs
Examples: MongoDB, CouchDB, Amazon DocumentDB
Pros: Flexible schema, natural JSON mapping, nested data
Cons: No JOINs (denormalize), transaction support varies

3. Wide-Column Stores

Row Key → { Column Family: { column: value, ... } }
"user:123" → { "profile": { "name": "Zineddine", "age": "28" },
                "activity": { "last_login": "2024-01-15", ... } }

Use cases: Time-series, IoT, analytics, write-heavy workloads
Examples: Cassandra, HBase, Google Bigtable
Pros: Massive write throughput, linear scalability
Cons: Limited query flexibility, eventual consistency

4. Graph Databases

(User:Zineddine)-[:FOLLOWS]->(User:John)
(User:John)-[:LIKES]->(Post:42)

Use cases: Social networks, recommendations, fraud detection, knowledge graphs
Examples: Neo4j, Amazon Neptune, JanusGraph
Pros: Efficient traversal of relationships
Cons: Not great for non-graph queries, scaling is harder

SQL vs NoSQL Decision Framework

Factor	Choose SQL	Choose NoSQL
Data model	Well-structured, relational	Flexible, denormalized
Query patterns	Complex queries, ad-hoc	Simple lookups, known patterns
Consistency	Strong consistency needed	Eventual consistency acceptable
Scale	Vertical + read replicas	Horizontal (built-in sharding)
Schema	Stable, well-defined	Evolving, schema-less
Transactions	Multi-row ACID needed	Single-document atomic ops OK
Examples	Banking, ERP, e-commerce inventory	Social feeds, IoT, caching, logs

Interview tip: Don't say "NoSQL is better for scale." Many SQL databases scale well. Be specific about what NoSQL gives you for your use case.

Data Modeling Concepts

Normalization (SQL)

1NF: Atomic values, no repeating groups
2NF: 1NF + no partial dependencies
3NF: 2NF + no transitive dependencies

Trade-off: Normalized data avoids redundancy but requires JOINs. Denormalized data is faster to read but harder to maintain.

Denormalization (NoSQL / Read-Heavy Systems)

Embed related data in the same document/row
Trade write complexity for read performance
Accept data duplication for speed

Example — Order with embedded items:

json
{
  "order_id": "123",
  "user_name": "Zineddine",  // duplicated from users
  "items": [
    { "product_name": "Keyboard", "price": 89.99 },
    { "product_name": "Mouse", "price": 49.99 }
  ],
  "total": 139.98
}

Data Access Patterns

Always start from how you'll query, then design the schema:

What are the primary queries?
What are the read/write ratios?
What data is accessed together?
How will data grow over time?

Storage Engines (Interview Bonus)

B-Tree (Traditional)

Used by PostgreSQL, MySQL InnoDB
Good for read-heavy workloads
O(log n) reads and writes
In-place updates

LSM-Tree (Log-Structured Merge-Tree)

Used by Cassandra, RocksDB, LevelDB, HBase
Good for write-heavy workloads
Writes go to memtable → flushed to sorted SSTable files
Background compaction merges SSTables
Reads may need to check multiple levels

Comparison

	B-Tree	LSM-Tree
Write speed	Moderate	Fast (sequential writes)
Read speed	Fast (one lookup)	Moderate (multiple levels)
Space amplification	Lower	Higher (before compaction)
Write amplification	Higher	Lower
Best for	Read-heavy, OLTP	Write-heavy, time-series

Resources

📖 DDIA Chapter 2: Data Models and Query Languages
📖 DDIA Chapter 3: Storage and Retrieval
📖 DDIA Chapter 7: Transactions
📖 "Database Internals" by Alex Petrov
🎥 CMU Database Group — Intro to Database Systems
🔗 Use The Index, Luke — SQL Indexing Tutorial

Previous: 03 - APIs & Communication Patterns | Next: 05 - Load Balancing & Reverse Proxies