20 - Object Storage & File Systems

Previous: 19 - Bloom Filters & Probabilistic Data Structures | Next: 21 - Search Systems


Storage Types: Block vs File vs Object

Understanding the three fundamental storage paradigms is essential for system design.

+------------------+-------------------+-------------------+
|   Block Storage  |   File Storage    |  Object Storage   |
+------------------+-------------------+-------------------+
|  Raw disk blocks |  Hierarchical     |  Flat namespace   |
|  No metadata     |  Directories +    |  Key + metadata + |
|  OS manages FS   |  file paths       |  blob             |
|  Lowest latency  |  POSIX semantics  |  HTTP API (REST)  |
+------------------+-------------------+-------------------+
|  EBS, SAN, iSCSI |  NFS, EFS, SMB   |  S3, GCS, Azure   |
|                  |                   |  Blob Storage     |
+------------------+-------------------+-------------------+
PropertyBlockFileObject
Access patternRandom R/W at byte levelHierarchical pathKey-value via HTTP
LatencySub-msLow ms10-100+ ms
ScalabilitySingle serverLimited (NFS bottleneck)Virtually unlimited
MetadataNone (FS adds it)File attributesCustom, rich metadata
MutabilityIn-place updatesIn-place updatesImmutable (replace whole object)
Use caseDatabases, VMsShared files, home dirsMedia, backups, data lakes

Amazon S3 (The De Facto Standard)

Architecture (Simplified)

Client --> S3 API (REST/HTTP)
              |
    +---------+---------+
    |    Index Layer     |  (maps keys to physical locations)
    |   (metadata DB)    |
    +---------+---------+
              |
    +---------+---------+
    |   Storage Layer    |  (distributed blob store)
    |   (replicated      |
    |    across AZs)     |
    +-------------------+
  • Objects stored as immutable blobs
  • Metadata index maps bucket/key to physical storage location
  • Data replicated across at least 3 Availability Zones
  • 11 nines (99.999999999%) durability

Consistency Model

S3 provides strong read-after-write consistency (since December 2020):

  • After a successful PUT, any subsequent GET returns the new object
  • After a successful DELETE, any subsequent GET returns 404
  • LIST operations reflect the latest state

Before 2020, S3 had eventual consistency for overwrites and deletes. This evolution is a good interview talking point about how real systems adapt their consistency guarantees.

Storage Classes

ClassAvailabilityMin DurationUse CaseCost (relative)
Standard99.99%NoneHot data, frequent access1x
Intelligent-Tiering99.9%NoneUnknown access patterns~1x + monitoring fee
Standard-IA99.9%30 daysInfrequent access0.5x storage, higher retrieval
One Zone-IA99.5%30 daysReproducible infrequent data0.4x
Glacier Instant99.9%90 daysArchive with instant retrieval0.25x
Glacier Flexible99.99%90 daysArchive, minutes-to-hours retrieval0.1x
Glacier Deep Archive99.99%180 daysLong-term archive, 12-48h retrieval0.03x

Lifecycle policies automate transitions:

Standard (day 0-30) --> Standard-IA (day 30-90) --> Glacier (day 90+)

Approximate Costs (per GB/month)

S3 Standard:          $0.023
S3 Standard-IA:       $0.0125
S3 Glacier Instant:   $0.004
S3 Glacier Deep:      $0.00099

GFS / HDFS Architecture

Google File System (GFS) and Hadoop Distributed File System (HDFS) share the same master-worker architecture, designed for storing very large files across commodity hardware.

Architecture

                    +------------------+
                    |    NameNode      |
                    |  (Master/Leader) |
                    |  - File metadata |
                    |  - Block mapping |
                    |  - Namespace ops |
                    +--------+---------+
                             |
              +--------------+--------------+
              |              |              |
     +--------+-----+ +-----+-------+ +----+--------+
     |  DataNode 1   | |  DataNode 2  | |  DataNode 3 |
     | [blk1] [blk3] | | [blk1] [blk2]| | [blk2][blk3]|
     +---------------+ +--------------+ +-------------+

NameNode (Master):

  • Stores all file system metadata in memory
  • Maps files to blocks (typically 64 MB in GFS, 128 MB in HDFS)
  • Tracks which DataNodes hold each block replica
  • Single point of failure (mitigated by standby NameNode + edit log journal)

DataNode (Worker):

  • Stores actual data blocks on local disks
  • Sends heartbeats and block reports to NameNode
  • Serves read/write requests directly to clients

Chunk Replication and Rack-Aware Placement

Default replication factor: 3 copies of every block.

Rack-aware placement strategy:
  Copy 1: DataNode on Rack A  (local rack for writer)
  Copy 2: DataNode on Rack B  (different rack for failure isolation)
  Copy 3: DataNode on Rack B  (same rack as copy 2 for bandwidth)

If Rack A fails entirely, copies 2 and 3 on Rack B survive.
If a single node fails, copies on the other rack survive.

Write Path (Pipeline Replication)

Client                  NameNode              DataNodes
  |                        |                      |
  |-- create(file) ------->|                      |
  |<-- block locations ----|                      |
  |                        |                      |
  |-- write(block) ------->| DN1 --> DN2 --> DN3  |
  |   (pipeline            | (replication chain)  |
  |    replication)        |                      |
  |<-- ack (all replicas)--|                      |
  1. Client asks NameNode for block allocation
  2. NameNode returns ordered list of DataNodes for replicas
  3. Client streams data to first DataNode, which pipelines to the next
  4. Ack propagates back through the chain
  5. Client sends "block complete" to NameNode

Read Path

Client                  NameNode              DataNode
  |                        |                      |
  |-- open(file) --------->|                      |
  |<-- block locations ----|                      |
  |                        |                      |
  |-- read(block) --------------------------->DN1 |
  |<-- data ------------------------------------  |

Client reads directly from the closest DataNode. The NameNode is never in the data path.

GFS vs HDFS Differences

FeatureGFSHDFS
Chunk size64 MB128 MB (default)
Append semanticsRecord append (atomic, at-least-once)Append-only (single writer)
ConsistencyRelaxed (defined regions, possibly inconsistent)Strong (single writer per file)
Master HAChubby lock serviceQuorum Journal Manager

Designing a File Storage System

Upload Flow

Client                  App Server           Object Store (S3)
  |                        |                      |
  |-- "upload 500MB" ----->|                      |
  |                        |-- generate pre-signed URL
  |<-- pre-signed URL -----|                      |
  |                        |                      |
  |-- PUT (multipart) -------------------------------->|
  |    Part 1 (64 MB) ---------------------------------|
  |    Part 2 (64 MB) ---------------------------------|
  |    ...                                              |
  |    Part 8 (52 MB) ---------------------------------|
  |<-- complete multipart ------------------------------|
  |                        |                      |
  |-- "upload done" ------>|                      |
  |                        |-- save metadata to DB
  |<-- 200 OK ------------|                      |

Chunking

Break large files into fixed-size chunks for parallel transfer and storage.

Original file (500 MB):
  [Chunk 0: 64MB] [Chunk 1: 64MB] ... [Chunk 7: 52MB]

Each chunk:
  - Has a unique ID (content hash or UUID)
  - Stored with replication factor R
  - Metadata maps file_id --> ordered list of chunk_ids

Chunk size trade-offs:

SizeProsCons
Small (1-4 MB)Fast random access, fine-grained dedupMore metadata, more chunks to track
Medium (64 MB)Balanced; HDFS defaultGood for large files
Large (256 MB)Fewer chunks, less metadataWaste for small files, slow partial reads

Deduplication with Content-Addressable Storage

chunk_id = SHA-256(chunk_data)

File A: [abc123, def456, ghi789]
File B: [abc123, xyz999, ghi789]
              ^               ^
         Shared chunks -- stored only once
  • File-level dedup: Hash entire file; catches exact duplicates only
  • Chunk-level dedup: Hash each chunk; catches partial duplicates (more savings)
  • Variable-size chunking (Rabin fingerprint): Better dedup ratio for shifted content, more implementation complexity

Pre-Signed URLs

Allow clients to upload/download directly from object storage without exposing credentials or routing traffic through app servers.

Upload flow:
  1. Client --> App Server: "I want to upload photo.jpg"
  2. App Server generates pre-signed PUT URL (expires in 15 min)
     URL = https://bucket.s3.amazonaws.com/photo.jpg
           ?X-Amz-Algorithm=AWS4-HMAC-SHA256
           &X-Amz-Credential=...
           &X-Amz-Expires=900
           &X-Amz-Signature=abc123...
  3. Client uploads directly to S3 using the pre-signed URL
  4. S3 validates signature and accepts/rejects the upload

Download flow:
  1. Client --> App Server: "I want to download report.pdf"
  2. App Server checks authorization, generates pre-signed GET URL
  3. Client fetches directly from S3

Why this matters:

  • App server is not a bandwidth bottleneck
  • Traffic flows directly between client and storage
  • Fine-grained access control (specific key, specific operation, time-limited)

Multipart Upload for Large Files

For files >100 MB, upload in parts for reliability and parallelism.

1. Initiate multipart upload --> get upload_id
2. Upload parts in parallel:
   Part 1 (5-64 MB) --> ETag: "aaa"
   Part 2 (5-64 MB) --> ETag: "bbb"
   Part 3 (remaining) --> ETag: "ccc"
3. Complete upload (send ordered list of part numbers + ETags)
4. Storage assembles the final object

On failure: retry only the failed part (not the entire file)

Benefits:

  • Resume interrupted uploads (only re-upload failed parts)
  • Parallel uploads from multiple threads
  • No single-request size limits (S3 single PUT: 5 GB; multipart: 5 TB)

Cleanup: Abort incomplete multipart uploads to avoid charges. Use lifecycle rules to auto-delete after N days.


CDN Integration for Serving

  User --> CDN Edge (cache) --> Origin (S3 / storage service)
              |
         Cache HIT: serve directly (~5-20ms)
         Cache MISS: fetch from origin, cache, serve (~100ms+)

Cache invalidation strategies:

StrategyMechanismBest For
TTL-basedCache-Control: max-age=86400Content that changes predictably
Versioned URLs/img/avatar_v3.webpControlled releases
Content-hash URLs/img/abc123def.webpImmutable assets (cache forever)
Explicit purgeCDN API call to invalidate pathEmergency updates

Best practice: Content-hash URLs for static assets (infinite cache), short TTLs for dynamic content.


Cost Optimization Strategies

StrategySavingsComplexity
Lifecycle policies (hot -> cold -> archive)60-90% on aged dataLow
Compression (gzip, zstd before upload)30-70% depending on data typeLow
Chunk-level deduplication20-50% depending on overlapMedium
Right-sizing storage class40-80%Low
Deleting unused data100% of wasteLow (governance is hard)
Tiered storage (SSD hot, HDD cold)50-70% on cold dataMedium

Interview Tips

  1. Know the three storage types cold. Block vs file vs object is a common opening question. Explain trade-offs in terms of latency, scalability, and mutability.
  2. S3 consistency model changed. Mentioning the 2020 strong consistency upgrade shows up-to-date knowledge.
  3. Explain pre-signed URLs. This pattern appears in almost every file-upload design question.
  4. Chunking + dedup is the core insight for any "design Dropbox/Google Drive" question.
  5. Don't forget cost. Storage cost optimization is a real production concern that senior-level interviewers appreciate.
  6. HDFS/GFS architecture is relevant for "design a distributed file system" and big data pipeline questions. Know the NameNode bottleneck and how rack-aware placement works.

Resources

  • DDIA (Kleppmann) -- Chapter 3: Storage and Retrieval
  • System Design Interview Vol. 1 (Alex Xu) -- Chapter on Google Drive design
  • GFS Paper: "The Google File System" (Ghemawat, Gobioff, Leung, 2003)
  • HDFS Architecture Guide: Apache Hadoop documentation
  • AWS S3 Documentation: Consistency model, storage classes, multipart upload, pre-signed URLs
  • Dropbox Engineering Blog: "Optimizing File Storage" (chunking and dedup)
  • Facebook Haystack Paper: "Finding a Needle in Haystack" (photo storage at scale)

Previous: 19 - Bloom Filters & Probabilistic Data Structures | Next: 21 - Search Systems