Data Storage Fundamentals

Why This Matters

"Which database should we use?" comes up in every system design interview. The answer is always "it depends" — and you need to explain what it depends on.


SQL (Relational Databases)

Core Concepts

  • Tables with rows and columns, fixed schema
  • Relations via foreign keys (1:1, 1:N, M:N)
  • SQL — declarative query language
  • ACID transactions

ACID Properties

PropertyMeaningExample
AtomicityAll or nothing — transaction fully commits or fully rolls backBank transfer: debit AND credit both happen or neither
ConsistencyData satisfies all constraints after transactionBalance never goes negative if constraint exists
IsolationConcurrent transactions don't interfereTwo withdrawals don't see stale balance
DurabilityCommitted data survives crashesWrite-ahead log ensures data persists

Isolation Levels

LevelDirty ReadNon-Repeatable ReadPhantom ReadPerformance
Read Uncommitted✅ possible✅ possible✅ possibleFastest
Read Committed❌ prevented✅ possible✅ possibleFast
Repeatable Read❌ prevented❌ prevented✅ possibleMedium
Serializable❌ prevented❌ prevented❌ preventedSlowest

Most production systems use Read Committed (PostgreSQL default) or Repeatable Read (MySQL InnoDB default).

When to Use SQL

  • Complex queries with JOINs
  • Transactions are critical (financial data, inventory)
  • Data has clear relationships
  • Need strong consistency
  • Schema is well-defined and stable

Popular SQL Databases

DatabaseKnown ForUsed By
PostgreSQLFeature-rich, extensible, JSONBApple, Instagram, Uber
MySQLBattle-tested, fast readsMeta, Airbnb, GitHub
Amazon AuroraMySQL/PG compatible, cloud-nativeMany AWS users
Google SpannerGlobally distributed SQLGoogle
CockroachDBDistributed SQL, PostgreSQL wire-compatible

NoSQL Databases

BASE Properties (Contrast with ACID)

PropertyMeaning
Basically AvailableSystem guarantees availability (may return stale data)
Soft stateState may change over time even without new input
Eventually consistentSystem will converge to consistent state eventually

Types of NoSQL

1. Key-Value Stores

Key → Value (opaque blob)
"user:123" → { serialized user data }
  • Use cases: Session storage, caching, simple lookups
  • Examples: Redis, Memcached, DynamoDB, etcd
  • Pros: Extremely fast, simple, horizontally scalable
  • Cons: No complex queries, no relationships

2. Document Stores

json
{ "_id": "user_123", "name": "Zineddine", "orders": [ { "id": "ord_1", "total": 59.99 }, { "id": "ord_2", "total": 129.00 } ] }
  • Use cases: Content management, user profiles, catalogs
  • Examples: MongoDB, CouchDB, Amazon DocumentDB
  • Pros: Flexible schema, natural JSON mapping, nested data
  • Cons: No JOINs (denormalize), transaction support varies

3. Wide-Column Stores

Row Key → { Column Family: { column: value, ... } }
"user:123" → { "profile": { "name": "Zineddine", "age": "28" },
                "activity": { "last_login": "2024-01-15", ... } }
  • Use cases: Time-series, IoT, analytics, write-heavy workloads
  • Examples: Cassandra, HBase, Google Bigtable
  • Pros: Massive write throughput, linear scalability
  • Cons: Limited query flexibility, eventual consistency

4. Graph Databases

(User:Zineddine)-[:FOLLOWS]->(User:John)
(User:John)-[:LIKES]->(Post:42)
  • Use cases: Social networks, recommendations, fraud detection, knowledge graphs
  • Examples: Neo4j, Amazon Neptune, JanusGraph
  • Pros: Efficient traversal of relationships
  • Cons: Not great for non-graph queries, scaling is harder

SQL vs NoSQL Decision Framework

FactorChoose SQLChoose NoSQL
Data modelWell-structured, relationalFlexible, denormalized
Query patternsComplex queries, ad-hocSimple lookups, known patterns
ConsistencyStrong consistency neededEventual consistency acceptable
ScaleVertical + read replicasHorizontal (built-in sharding)
SchemaStable, well-definedEvolving, schema-less
TransactionsMulti-row ACID neededSingle-document atomic ops OK
ExamplesBanking, ERP, e-commerce inventorySocial feeds, IoT, caching, logs

Interview tip: Don't say "NoSQL is better for scale." Many SQL databases scale well. Be specific about what NoSQL gives you for your use case.


Data Modeling Concepts

Normalization (SQL)

  • 1NF: Atomic values, no repeating groups
  • 2NF: 1NF + no partial dependencies
  • 3NF: 2NF + no transitive dependencies

Trade-off: Normalized data avoids redundancy but requires JOINs. Denormalized data is faster to read but harder to maintain.

Denormalization (NoSQL / Read-Heavy Systems)

  • Embed related data in the same document/row
  • Trade write complexity for read performance
  • Accept data duplication for speed

Example — Order with embedded items:

json
{ "order_id": "123", "user_name": "Zineddine", // duplicated from users "items": [ { "product_name": "Keyboard", "price": 89.99 }, { "product_name": "Mouse", "price": 49.99 } ], "total": 139.98 }

Data Access Patterns

Always start from how you'll query, then design the schema:

  1. What are the primary queries?
  2. What are the read/write ratios?
  3. What data is accessed together?
  4. How will data grow over time?

Storage Engines (Interview Bonus)

B-Tree (Traditional)

  • Used by PostgreSQL, MySQL InnoDB
  • Good for read-heavy workloads
  • O(log n) reads and writes
  • In-place updates

LSM-Tree (Log-Structured Merge-Tree)

  • Used by Cassandra, RocksDB, LevelDB, HBase
  • Good for write-heavy workloads
  • Writes go to memtable → flushed to sorted SSTable files
  • Background compaction merges SSTables
  • Reads may need to check multiple levels

Comparison

B-TreeLSM-Tree
Write speedModerateFast (sequential writes)
Read speedFast (one lookup)Moderate (multiple levels)
Space amplificationLowerHigher (before compaction)
Write amplificationHigherLower
Best forRead-heavy, OLTPWrite-heavy, time-series

Resources


Previous: 03 - APIs & Communication Patterns | Next: 05 - Load Balancing & Reverse Proxies