20 - Object Storage & File Systems
Previous: 19 - Bloom Filters & Probabilistic Data Structures | Next: 21 - Search Systems
Storage Types: Block vs File vs Object
Understanding the three fundamental storage paradigms is essential for system design.
+------------------+-------------------+-------------------+
| Block Storage | File Storage | Object Storage |
+------------------+-------------------+-------------------+
| Raw disk blocks | Hierarchical | Flat namespace |
| No metadata | Directories + | Key + metadata + |
| OS manages FS | file paths | blob |
| Lowest latency | POSIX semantics | HTTP API (REST) |
+------------------+-------------------+-------------------+
| EBS, SAN, iSCSI | NFS, EFS, SMB | S3, GCS, Azure |
| | | Blob Storage |
+------------------+-------------------+-------------------+
| Property | Block | File | Object |
|---|---|---|---|
| Access pattern | Random R/W at byte level | Hierarchical path | Key-value via HTTP |
| Latency | Sub-ms | Low ms | 10-100+ ms |
| Scalability | Single server | Limited (NFS bottleneck) | Virtually unlimited |
| Metadata | None (FS adds it) | File attributes | Custom, rich metadata |
| Mutability | In-place updates | In-place updates | Immutable (replace whole object) |
| Use case | Databases, VMs | Shared files, home dirs | Media, backups, data lakes |
Amazon S3 (The De Facto Standard)
Architecture (Simplified)
Client --> S3 API (REST/HTTP)
|
+---------+---------+
| Index Layer | (maps keys to physical locations)
| (metadata DB) |
+---------+---------+
|
+---------+---------+
| Storage Layer | (distributed blob store)
| (replicated |
| across AZs) |
+-------------------+
- Objects stored as immutable blobs
- Metadata index maps
bucket/keyto physical storage location - Data replicated across at least 3 Availability Zones
- 11 nines (99.999999999%) durability
Consistency Model
S3 provides strong read-after-write consistency (since December 2020):
- After a successful PUT, any subsequent GET returns the new object
- After a successful DELETE, any subsequent GET returns 404
- LIST operations reflect the latest state
Before 2020, S3 had eventual consistency for overwrites and deletes. This evolution is a good interview talking point about how real systems adapt their consistency guarantees.
Storage Classes
| Class | Availability | Min Duration | Use Case | Cost (relative) |
|---|---|---|---|---|
| Standard | 99.99% | None | Hot data, frequent access | 1x |
| Intelligent-Tiering | 99.9% | None | Unknown access patterns | ~1x + monitoring fee |
| Standard-IA | 99.9% | 30 days | Infrequent access | 0.5x storage, higher retrieval |
| One Zone-IA | 99.5% | 30 days | Reproducible infrequent data | 0.4x |
| Glacier Instant | 99.9% | 90 days | Archive with instant retrieval | 0.25x |
| Glacier Flexible | 99.99% | 90 days | Archive, minutes-to-hours retrieval | 0.1x |
| Glacier Deep Archive | 99.99% | 180 days | Long-term archive, 12-48h retrieval | 0.03x |
Lifecycle policies automate transitions:
Standard (day 0-30) --> Standard-IA (day 30-90) --> Glacier (day 90+)
Approximate Costs (per GB/month)
S3 Standard: $0.023
S3 Standard-IA: $0.0125
S3 Glacier Instant: $0.004
S3 Glacier Deep: $0.00099
GFS / HDFS Architecture
Google File System (GFS) and Hadoop Distributed File System (HDFS) share the same master-worker architecture, designed for storing very large files across commodity hardware.
Architecture
+------------------+
| NameNode |
| (Master/Leader) |
| - File metadata |
| - Block mapping |
| - Namespace ops |
+--------+---------+
|
+--------------+--------------+
| | |
+--------+-----+ +-----+-------+ +----+--------+
| DataNode 1 | | DataNode 2 | | DataNode 3 |
| [blk1] [blk3] | | [blk1] [blk2]| | [blk2][blk3]|
+---------------+ +--------------+ +-------------+
NameNode (Master):
- Stores all file system metadata in memory
- Maps files to blocks (typically 64 MB in GFS, 128 MB in HDFS)
- Tracks which DataNodes hold each block replica
- Single point of failure (mitigated by standby NameNode + edit log journal)
DataNode (Worker):
- Stores actual data blocks on local disks
- Sends heartbeats and block reports to NameNode
- Serves read/write requests directly to clients
Chunk Replication and Rack-Aware Placement
Default replication factor: 3 copies of every block.
Rack-aware placement strategy:
Copy 1: DataNode on Rack A (local rack for writer)
Copy 2: DataNode on Rack B (different rack for failure isolation)
Copy 3: DataNode on Rack B (same rack as copy 2 for bandwidth)
If Rack A fails entirely, copies 2 and 3 on Rack B survive.
If a single node fails, copies on the other rack survive.
Write Path (Pipeline Replication)
Client NameNode DataNodes
| | |
|-- create(file) ------->| |
|<-- block locations ----| |
| | |
|-- write(block) ------->| DN1 --> DN2 --> DN3 |
| (pipeline | (replication chain) |
| replication) | |
|<-- ack (all replicas)--| |
- Client asks NameNode for block allocation
- NameNode returns ordered list of DataNodes for replicas
- Client streams data to first DataNode, which pipelines to the next
- Ack propagates back through the chain
- Client sends "block complete" to NameNode
Read Path
Client NameNode DataNode
| | |
|-- open(file) --------->| |
|<-- block locations ----| |
| | |
|-- read(block) --------------------------->DN1 |
|<-- data ------------------------------------ |
Client reads directly from the closest DataNode. The NameNode is never in the data path.
GFS vs HDFS Differences
| Feature | GFS | HDFS |
|---|---|---|
| Chunk size | 64 MB | 128 MB (default) |
| Append semantics | Record append (atomic, at-least-once) | Append-only (single writer) |
| Consistency | Relaxed (defined regions, possibly inconsistent) | Strong (single writer per file) |
| Master HA | Chubby lock service | Quorum Journal Manager |
Designing a File Storage System
Upload Flow
Client App Server Object Store (S3)
| | |
|-- "upload 500MB" ----->| |
| |-- generate pre-signed URL
|<-- pre-signed URL -----| |
| | |
|-- PUT (multipart) -------------------------------->|
| Part 1 (64 MB) ---------------------------------|
| Part 2 (64 MB) ---------------------------------|
| ... |
| Part 8 (52 MB) ---------------------------------|
|<-- complete multipart ------------------------------|
| | |
|-- "upload done" ------>| |
| |-- save metadata to DB
|<-- 200 OK ------------| |
Chunking
Break large files into fixed-size chunks for parallel transfer and storage.
Original file (500 MB):
[Chunk 0: 64MB] [Chunk 1: 64MB] ... [Chunk 7: 52MB]
Each chunk:
- Has a unique ID (content hash or UUID)
- Stored with replication factor R
- Metadata maps file_id --> ordered list of chunk_ids
Chunk size trade-offs:
| Size | Pros | Cons |
|---|---|---|
| Small (1-4 MB) | Fast random access, fine-grained dedup | More metadata, more chunks to track |
| Medium (64 MB) | Balanced; HDFS default | Good for large files |
| Large (256 MB) | Fewer chunks, less metadata | Waste for small files, slow partial reads |
Deduplication with Content-Addressable Storage
chunk_id = SHA-256(chunk_data)
File A: [abc123, def456, ghi789]
File B: [abc123, xyz999, ghi789]
^ ^
Shared chunks -- stored only once
- File-level dedup: Hash entire file; catches exact duplicates only
- Chunk-level dedup: Hash each chunk; catches partial duplicates (more savings)
- Variable-size chunking (Rabin fingerprint): Better dedup ratio for shifted content, more implementation complexity
Pre-Signed URLs
Allow clients to upload/download directly from object storage without exposing credentials or routing traffic through app servers.
Upload flow:
1. Client --> App Server: "I want to upload photo.jpg"
2. App Server generates pre-signed PUT URL (expires in 15 min)
URL = https://bucket.s3.amazonaws.com/photo.jpg
?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential=...
&X-Amz-Expires=900
&X-Amz-Signature=abc123...
3. Client uploads directly to S3 using the pre-signed URL
4. S3 validates signature and accepts/rejects the upload
Download flow:
1. Client --> App Server: "I want to download report.pdf"
2. App Server checks authorization, generates pre-signed GET URL
3. Client fetches directly from S3
Why this matters:
- App server is not a bandwidth bottleneck
- Traffic flows directly between client and storage
- Fine-grained access control (specific key, specific operation, time-limited)
Multipart Upload for Large Files
For files >100 MB, upload in parts for reliability and parallelism.
1. Initiate multipart upload --> get upload_id
2. Upload parts in parallel:
Part 1 (5-64 MB) --> ETag: "aaa"
Part 2 (5-64 MB) --> ETag: "bbb"
Part 3 (remaining) --> ETag: "ccc"
3. Complete upload (send ordered list of part numbers + ETags)
4. Storage assembles the final object
On failure: retry only the failed part (not the entire file)
Benefits:
- Resume interrupted uploads (only re-upload failed parts)
- Parallel uploads from multiple threads
- No single-request size limits (S3 single PUT: 5 GB; multipart: 5 TB)
Cleanup: Abort incomplete multipart uploads to avoid charges. Use lifecycle rules to auto-delete after N days.
CDN Integration for Serving
User --> CDN Edge (cache) --> Origin (S3 / storage service)
|
Cache HIT: serve directly (~5-20ms)
Cache MISS: fetch from origin, cache, serve (~100ms+)
Cache invalidation strategies:
| Strategy | Mechanism | Best For |
|---|---|---|
| TTL-based | Cache-Control: max-age=86400 | Content that changes predictably |
| Versioned URLs | /img/avatar_v3.webp | Controlled releases |
| Content-hash URLs | /img/abc123def.webp | Immutable assets (cache forever) |
| Explicit purge | CDN API call to invalidate path | Emergency updates |
Best practice: Content-hash URLs for static assets (infinite cache), short TTLs for dynamic content.
Cost Optimization Strategies
| Strategy | Savings | Complexity |
|---|---|---|
| Lifecycle policies (hot -> cold -> archive) | 60-90% on aged data | Low |
| Compression (gzip, zstd before upload) | 30-70% depending on data type | Low |
| Chunk-level deduplication | 20-50% depending on overlap | Medium |
| Right-sizing storage class | 40-80% | Low |
| Deleting unused data | 100% of waste | Low (governance is hard) |
| Tiered storage (SSD hot, HDD cold) | 50-70% on cold data | Medium |
Interview Tips
- Know the three storage types cold. Block vs file vs object is a common opening question. Explain trade-offs in terms of latency, scalability, and mutability.
- S3 consistency model changed. Mentioning the 2020 strong consistency upgrade shows up-to-date knowledge.
- Explain pre-signed URLs. This pattern appears in almost every file-upload design question.
- Chunking + dedup is the core insight for any "design Dropbox/Google Drive" question.
- Don't forget cost. Storage cost optimization is a real production concern that senior-level interviewers appreciate.
- HDFS/GFS architecture is relevant for "design a distributed file system" and big data pipeline questions. Know the NameNode bottleneck and how rack-aware placement works.
Resources
- DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval
- System Design Interview Vol. 1 (Alex Xu) -- Chapter on Google Drive design
- GFS Paper: "The Google File System" (Ghemawat, Gobioff, Leung, 2003)
- HDFS Architecture Guide: Apache Hadoop documentation
- AWS S3 Documentation: Consistency model, storage classes, multipart upload, pre-signed URLs
- Dropbox Engineering Blog: "Optimizing File Storage" (chunking and dedup)
- Facebook Haystack Paper: "Finding a Needle in Haystack" (photo storage at scale)
Previous: 19 - Bloom Filters & Probabilistic Data Structures | Next: 21 - Search Systems