20 - Object Storage & File Systems

Previous: 19 - Bloom Filters & Probabilistic Data Structures | Next: 21 - Search Systems

Storage Types: Block vs File vs Object

Understanding the three fundamental storage paradigms is essential for system design.

+------------------+-------------------+-------------------+
|   Block Storage  |   File Storage    |  Object Storage   |
+------------------+-------------------+-------------------+
|  Raw disk blocks |  Hierarchical     |  Flat namespace   |
|  No metadata     |  Directories +    |  Key + metadata + |
|  OS manages FS   |  file paths       |  blob             |
|  Lowest latency  |  POSIX semantics  |  HTTP API (REST)  |
+------------------+-------------------+-------------------+
|  EBS, SAN, iSCSI |  NFS, EFS, SMB   |  S3, GCS, Azure   |
|                  |                   |  Blob Storage     |
+------------------+-------------------+-------------------+

Property	Block	File	Object
Access pattern	Random R/W at byte level	Hierarchical path	Key-value via HTTP
Latency	Sub-ms	Low ms	10-100+ ms
Scalability	Single server	Limited (NFS bottleneck)	Virtually unlimited
Metadata	None (FS adds it)	File attributes	Custom, rich metadata
Mutability	In-place updates	In-place updates	Immutable (replace whole object)
Use case	Databases, VMs	Shared files, home dirs	Media, backups, data lakes

Amazon S3 (The De Facto Standard)

Architecture (Simplified)

Client --> S3 API (REST/HTTP)
              |
    +---------+---------+
    |    Index Layer     |  (maps keys to physical locations)
    |   (metadata DB)    |
    +---------+---------+
              |
    +---------+---------+
    |   Storage Layer    |  (distributed blob store)
    |   (replicated      |
    |    across AZs)     |
    +-------------------+

Objects stored as immutable blobs
Metadata index maps bucket/key to physical storage location
Data replicated across at least 3 Availability Zones
11 nines (99.999999999%) durability

Consistency Model

S3 provides strong read-after-write consistency (since December 2020):

After a successful PUT, any subsequent GET returns the new object
After a successful DELETE, any subsequent GET returns 404
LIST operations reflect the latest state

Before 2020, S3 had eventual consistency for overwrites and deletes. This evolution is a good interview talking point about how real systems adapt their consistency guarantees.

Storage Classes

Class	Availability	Min Duration	Use Case	Cost (relative)
Standard	99.99%	None	Hot data, frequent access	1x
Intelligent-Tiering	99.9%	None	Unknown access patterns	~1x + monitoring fee
Standard-IA	99.9%	30 days	Infrequent access	0.5x storage, higher retrieval
One Zone-IA	99.5%	30 days	Reproducible infrequent data	0.4x
Glacier Instant	99.9%	90 days	Archive with instant retrieval	0.25x
Glacier Flexible	99.99%	90 days	Archive, minutes-to-hours retrieval	0.1x
Glacier Deep Archive	99.99%	180 days	Long-term archive, 12-48h retrieval	0.03x

Lifecycle policies automate transitions:

Standard (day 0-30) --> Standard-IA (day 30-90) --> Glacier (day 90+)

Approximate Costs (per GB/month)

S3 Standard:          $0.023
S3 Standard-IA:       $0.0125
S3 Glacier Instant:   $0.004
S3 Glacier Deep:      $0.00099

GFS / HDFS Architecture

Google File System (GFS) and Hadoop Distributed File System (HDFS) share the same master-worker architecture, designed for storing very large files across commodity hardware.

Architecture

                    +------------------+
                    |    NameNode      |
                    |  (Master/Leader) |
                    |  - File metadata |
                    |  - Block mapping |
                    |  - Namespace ops |
                    +--------+---------+
                             |
              +--------------+--------------+
              |              |              |
     +--------+-----+ +-----+-------+ +----+--------+
     |  DataNode 1   | |  DataNode 2  | |  DataNode 3 |
     | [blk1] [blk3] | | [blk1] [blk2]| | [blk2][blk3]|
     +---------------+ +--------------+ +-------------+

NameNode (Master):

Stores all file system metadata in memory
Maps files to blocks (typically 64 MB in GFS, 128 MB in HDFS)
Tracks which DataNodes hold each block replica
Single point of failure (mitigated by standby NameNode + edit log journal)

DataNode (Worker):

Stores actual data blocks on local disks
Sends heartbeats and block reports to NameNode
Serves read/write requests directly to clients

Chunk Replication and Rack-Aware Placement

Default replication factor: 3 copies of every block.

Rack-aware placement strategy:
  Copy 1: DataNode on Rack A  (local rack for writer)
  Copy 2: DataNode on Rack B  (different rack for failure isolation)
  Copy 3: DataNode on Rack B  (same rack as copy 2 for bandwidth)

If Rack A fails entirely, copies 2 and 3 on Rack B survive.
If a single node fails, copies on the other rack survive.

Write Path (Pipeline Replication)

Client                  NameNode              DataNodes
  |                        |                      |
  |-- create(file) ------->|                      |
  |<-- block locations ----|                      |
  |                        |                      |
  |-- write(block) ------->| DN1 --> DN2 --> DN3  |
  |   (pipeline            | (replication chain)  |
  |    replication)        |                      |
  |<-- ack (all replicas)--|                      |

Client asks NameNode for block allocation
NameNode returns ordered list of DataNodes for replicas
Client streams data to first DataNode, which pipelines to the next
Ack propagates back through the chain
Client sends "block complete" to NameNode

Read Path

Client                  NameNode              DataNode
  |                        |                      |
  |-- open(file) --------->|                      |
  |<-- block locations ----|                      |
  |                        |                      |
  |-- read(block) --------------------------->DN1 |
  |<-- data ------------------------------------  |

Client reads directly from the closest DataNode. The NameNode is never in the data path.

GFS vs HDFS Differences

Feature	GFS	HDFS
Chunk size	64 MB	128 MB (default)
Append semantics	Record append (atomic, at-least-once)	Append-only (single writer)
Consistency	Relaxed (defined regions, possibly inconsistent)	Strong (single writer per file)
Master HA	Chubby lock service	Quorum Journal Manager

Designing a File Storage System

Upload Flow

Client                  App Server           Object Store (S3)
  |                        |                      |
  |-- "upload 500MB" ----->|                      |
  |                        |-- generate pre-signed URL
  |<-- pre-signed URL -----|                      |
  |                        |                      |
  |-- PUT (multipart) -------------------------------->|
  |    Part 1 (64 MB) ---------------------------------|
  |    Part 2 (64 MB) ---------------------------------|
  |    ...                                              |
  |    Part 8 (52 MB) ---------------------------------|
  |<-- complete multipart ------------------------------|
  |                        |                      |
  |-- "upload done" ------>|                      |
  |                        |-- save metadata to DB
  |<-- 200 OK ------------|                      |

Chunking

Break large files into fixed-size chunks for parallel transfer and storage.

Original file (500 MB):
  [Chunk 0: 64MB] [Chunk 1: 64MB] ... [Chunk 7: 52MB]

Each chunk:
  - Has a unique ID (content hash or UUID)
  - Stored with replication factor R
  - Metadata maps file_id --> ordered list of chunk_ids

Chunk size trade-offs:

Size	Pros	Cons
Small (1-4 MB)	Fast random access, fine-grained dedup	More metadata, more chunks to track
Medium (64 MB)	Balanced; HDFS default	Good for large files
Large (256 MB)	Fewer chunks, less metadata	Waste for small files, slow partial reads

Deduplication with Content-Addressable Storage

chunk_id = SHA-256(chunk_data)

File A: [abc123, def456, ghi789]
File B: [abc123, xyz999, ghi789]
              ^               ^
         Shared chunks -- stored only once

File-level dedup: Hash entire file; catches exact duplicates only
Chunk-level dedup: Hash each chunk; catches partial duplicates (more savings)
Variable-size chunking (Rabin fingerprint): Better dedup ratio for shifted content, more implementation complexity

Pre-Signed URLs

Allow clients to upload/download directly from object storage without exposing credentials or routing traffic through app servers.

Upload flow:
  1. Client --> App Server: "I want to upload photo.jpg"
  2. App Server generates pre-signed PUT URL (expires in 15 min)
     URL = https://bucket.s3.amazonaws.com/photo.jpg
           ?X-Amz-Algorithm=AWS4-HMAC-SHA256
           &X-Amz-Credential=...
           &X-Amz-Expires=900
           &X-Amz-Signature=abc123...
  3. Client uploads directly to S3 using the pre-signed URL
  4. S3 validates signature and accepts/rejects the upload

Download flow:
  1. Client --> App Server: "I want to download report.pdf"
  2. App Server checks authorization, generates pre-signed GET URL
  3. Client fetches directly from S3

Why this matters:

App server is not a bandwidth bottleneck
Traffic flows directly between client and storage
Fine-grained access control (specific key, specific operation, time-limited)

Multipart Upload for Large Files

For files >100 MB, upload in parts for reliability and parallelism.

1. Initiate multipart upload --> get upload_id
2. Upload parts in parallel:
   Part 1 (5-64 MB) --> ETag: "aaa"
   Part 2 (5-64 MB) --> ETag: "bbb"
   Part 3 (remaining) --> ETag: "ccc"
3. Complete upload (send ordered list of part numbers + ETags)
4. Storage assembles the final object

On failure: retry only the failed part (not the entire file)

Benefits:

Resume interrupted uploads (only re-upload failed parts)
Parallel uploads from multiple threads
No single-request size limits (S3 single PUT: 5 GB; multipart: 5 TB)

Cleanup: Abort incomplete multipart uploads to avoid charges. Use lifecycle rules to auto-delete after N days.

CDN Integration for Serving

  User --> CDN Edge (cache) --> Origin (S3 / storage service)
              |
         Cache HIT: serve directly (~5-20ms)
         Cache MISS: fetch from origin, cache, serve (~100ms+)

Cache invalidation strategies:

Strategy	Mechanism	Best For
TTL-based	`Cache-Control: max-age=86400`	Content that changes predictably
Versioned URLs	`/img/avatar_v3.webp`	Controlled releases
Content-hash URLs	`/img/abc123def.webp`	Immutable assets (cache forever)
Explicit purge	CDN API call to invalidate path	Emergency updates

Best practice: Content-hash URLs for static assets (infinite cache), short TTLs for dynamic content.

Cost Optimization Strategies

Strategy	Savings	Complexity
Lifecycle policies (hot -> cold -> archive)	60-90% on aged data	Low
Compression (gzip, zstd before upload)	30-70% depending on data type	Low
Chunk-level deduplication	20-50% depending on overlap	Medium
Right-sizing storage class	40-80%	Low
Deleting unused data	100% of waste	Low (governance is hard)
Tiered storage (SSD hot, HDD cold)	50-70% on cold data	Medium

Interview Tips

Know the three storage types cold. Block vs file vs object is a common opening question. Explain trade-offs in terms of latency, scalability, and mutability.
S3 consistency model changed. Mentioning the 2020 strong consistency upgrade shows up-to-date knowledge.
Explain pre-signed URLs. This pattern appears in almost every file-upload design question.
Chunking + dedup is the core insight for any "design Dropbox/Google Drive" question.
Don't forget cost. Storage cost optimization is a real production concern that senior-level interviewers appreciate.
HDFS/GFS architecture is relevant for "design a distributed file system" and big data pipeline questions. Know the NameNode bottleneck and how rack-aware placement works.

Resources

DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval
System Design Interview Vol. 1 (Alex Xu) -- Chapter on Google Drive design
GFS Paper: "The Google File System" (Ghemawat, Gobioff, Leung, 2003)
HDFS Architecture Guide: Apache Hadoop documentation
AWS S3 Documentation: Consistency model, storage classes, multipart upload, pre-signed URLs
Dropbox Engineering Blog: "Optimizing File Storage" (chunking and dedup)
Facebook Haystack Paper: "Finding a Needle in Haystack" (photo storage at scale)

Previous: 19 - Bloom Filters & Probabilistic Data Structures | Next: 21 - Search Systems