Data Storage Fundamentals
Why This Matters
"Which database should we use?" comes up in every system design interview. The answer is always "it depends" — and you need to explain what it depends on.
SQL (Relational Databases)
Core Concepts
- Tables with rows and columns, fixed schema
- Relations via foreign keys (1:1, 1:N, M:N)
- SQL — declarative query language
- ACID transactions
ACID Properties
| Property | Meaning | Example |
|---|---|---|
| Atomicity | All or nothing — transaction fully commits or fully rolls back | Bank transfer: debit AND credit both happen or neither |
| Consistency | Data satisfies all constraints after transaction | Balance never goes negative if constraint exists |
| Isolation | Concurrent transactions don't interfere | Two withdrawals don't see stale balance |
| Durability | Committed data survives crashes | Write-ahead log ensures data persists |
Isolation Levels
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Performance |
|---|---|---|---|---|
| Read Uncommitted | ✅ possible | ✅ possible | ✅ possible | Fastest |
| Read Committed | ❌ prevented | ✅ possible | ✅ possible | Fast |
| Repeatable Read | ❌ prevented | ❌ prevented | ✅ possible | Medium |
| Serializable | ❌ prevented | ❌ prevented | ❌ prevented | Slowest |
Most production systems use Read Committed (PostgreSQL default) or Repeatable Read (MySQL InnoDB default).
When to Use SQL
- Complex queries with JOINs
- Transactions are critical (financial data, inventory)
- Data has clear relationships
- Need strong consistency
- Schema is well-defined and stable
Popular SQL Databases
| Database | Known For | Used By |
|---|---|---|
| PostgreSQL | Feature-rich, extensible, JSONB | Apple, Instagram, Uber |
| MySQL | Battle-tested, fast reads | Meta, Airbnb, GitHub |
| Amazon Aurora | MySQL/PG compatible, cloud-native | Many AWS users |
| Google Spanner | Globally distributed SQL | |
| CockroachDB | Distributed SQL, PostgreSQL wire-compatible |
NoSQL Databases
BASE Properties (Contrast with ACID)
| Property | Meaning |
|---|---|
| Basically Available | System guarantees availability (may return stale data) |
| Soft state | State may change over time even without new input |
| Eventually consistent | System will converge to consistent state eventually |
Types of NoSQL
1. Key-Value Stores
Key → Value (opaque blob)
"user:123" → { serialized user data }
- Use cases: Session storage, caching, simple lookups
- Examples: Redis, Memcached, DynamoDB, etcd
- Pros: Extremely fast, simple, horizontally scalable
- Cons: No complex queries, no relationships
2. Document Stores
json{ "_id": "user_123", "name": "Zineddine", "orders": [ { "id": "ord_1", "total": 59.99 }, { "id": "ord_2", "total": 129.00 } ] }
- Use cases: Content management, user profiles, catalogs
- Examples: MongoDB, CouchDB, Amazon DocumentDB
- Pros: Flexible schema, natural JSON mapping, nested data
- Cons: No JOINs (denormalize), transaction support varies
3. Wide-Column Stores
Row Key → { Column Family: { column: value, ... } }
"user:123" → { "profile": { "name": "Zineddine", "age": "28" },
"activity": { "last_login": "2024-01-15", ... } }
- Use cases: Time-series, IoT, analytics, write-heavy workloads
- Examples: Cassandra, HBase, Google Bigtable
- Pros: Massive write throughput, linear scalability
- Cons: Limited query flexibility, eventual consistency
4. Graph Databases
(User:Zineddine)-[:FOLLOWS]->(User:John)
(User:John)-[:LIKES]->(Post:42)
- Use cases: Social networks, recommendations, fraud detection, knowledge graphs
- Examples: Neo4j, Amazon Neptune, JanusGraph
- Pros: Efficient traversal of relationships
- Cons: Not great for non-graph queries, scaling is harder
SQL vs NoSQL Decision Framework
| Factor | Choose SQL | Choose NoSQL |
|---|---|---|
| Data model | Well-structured, relational | Flexible, denormalized |
| Query patterns | Complex queries, ad-hoc | Simple lookups, known patterns |
| Consistency | Strong consistency needed | Eventual consistency acceptable |
| Scale | Vertical + read replicas | Horizontal (built-in sharding) |
| Schema | Stable, well-defined | Evolving, schema-less |
| Transactions | Multi-row ACID needed | Single-document atomic ops OK |
| Examples | Banking, ERP, e-commerce inventory | Social feeds, IoT, caching, logs |
Interview tip: Don't say "NoSQL is better for scale." Many SQL databases scale well. Be specific about what NoSQL gives you for your use case.
Data Modeling Concepts
Normalization (SQL)
- 1NF: Atomic values, no repeating groups
- 2NF: 1NF + no partial dependencies
- 3NF: 2NF + no transitive dependencies
Trade-off: Normalized data avoids redundancy but requires JOINs. Denormalized data is faster to read but harder to maintain.
Denormalization (NoSQL / Read-Heavy Systems)
- Embed related data in the same document/row
- Trade write complexity for read performance
- Accept data duplication for speed
Example — Order with embedded items:
json{ "order_id": "123", "user_name": "Zineddine", // duplicated from users "items": [ { "product_name": "Keyboard", "price": 89.99 }, { "product_name": "Mouse", "price": 49.99 } ], "total": 139.98 }
Data Access Patterns
Always start from how you'll query, then design the schema:
- What are the primary queries?
- What are the read/write ratios?
- What data is accessed together?
- How will data grow over time?
Storage Engines (Interview Bonus)
B-Tree (Traditional)
- Used by PostgreSQL, MySQL InnoDB
- Good for read-heavy workloads
- O(log n) reads and writes
- In-place updates
LSM-Tree (Log-Structured Merge-Tree)
- Used by Cassandra, RocksDB, LevelDB, HBase
- Good for write-heavy workloads
- Writes go to memtable → flushed to sorted SSTable files
- Background compaction merges SSTables
- Reads may need to check multiple levels
Comparison
| B-Tree | LSM-Tree | |
|---|---|---|
| Write speed | Moderate | Fast (sequential writes) |
| Read speed | Fast (one lookup) | Moderate (multiple levels) |
| Space amplification | Lower | Higher (before compaction) |
| Write amplification | Higher | Lower |
| Best for | Read-heavy, OLTP | Write-heavy, time-series |
Resources
- 📖 DDIA Chapter 2: Data Models and Query Languages
- 📖 DDIA Chapter 3: Storage and Retrieval
- 📖 DDIA Chapter 7: Transactions
- 📖 "Database Internals" by Alex Petrov
- 🎥 CMU Database Group — Intro to Database Systems
- 🔗 Use The Index, Luke — SQL Indexing Tutorial
Previous: 03 - APIs & Communication Patterns | Next: 05 - Load Balancing & Reverse Proxies