23 - Data Pipelines & ETL

Previous: 22 - TODO | Next: 24 - Reliability Engineering

Why This Matters in Interviews

Data pipelines are the backbone of analytics, ML, and real-time features at FAANG scale. Interviewers expect you to reason about batch vs stream, trade-offs between latency and throughput, and how real systems like Spark/Flink fit together.

Batch vs Stream Processing

Dimension	Batch	Stream
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high (optimized for bulk)	Variable (per-event overhead)
Completeness	Processes complete datasets	Works with incomplete/unbounded data
Complexity	Simpler (bounded input)	Harder (ordering, late data, state)
Fault tolerance	Rerun the whole job	Checkpointing, exactly-once semantics
Use cases	Reports, training ML models, backfills	Fraud detection, live dashboards, alerts

MapReduce

Concept

A programming model for processing large datasets in parallel across a cluster.

Input Data
    |
  [Split] --> [Map] --> [Shuffle & Sort] --> [Reduce] --> Output
  [Split] --> [Map] ------/                  [Reduce]
  [Split] --> [Map] -----/                   [Reduce]

Word Count Example (Pseudocode)

MAP(document):
  for each word in document:
    emit(word, 1)

REDUCE(word, counts[]):
  emit(word, sum(counts))

Input: "the cat sat on the mat"

Map output:  (the,1) (cat,1) (sat,1) (on,1) (the,1) (mat,1)
Shuffle:     the -> [1,1]   cat -> [1]   sat -> [1]  ...
Reduce:      the -> 2       cat -> 1     sat -> 1    ...

Limitations

Disk I/O bottleneck: Intermediate results written to disk between stages
No iterative processing: Each iteration = full read/write cycle (bad for ML)
High latency: Not suitable for real-time or interactive queries
Rigid two-phase model: Complex pipelines require chaining multiple MR jobs
Programming complexity: Low-level API compared to SQL-like alternatives

Apache Spark

Why Faster Than MapReduce

MapReduce:  Disk --> Map --> Disk --> Reduce --> Disk
Spark:      Disk --> [Map --> Filter --> Reduce] (all in-memory) --> Disk

Spark keeps intermediate data in memory (up to 100x faster for iterative workloads).

Core Abstractions

Abstraction	Description	When to Use
RDD (Resilient Distributed Dataset)	Immutable distributed collection, lazy evaluation	Low-level control, custom partitioning
DataFrame	Distributed table with named columns, optimized by Catalyst	Most ETL, SQL-like operations
Dataset	Type-safe DataFrame (JVM languages)	Type safety in Scala/Java

Key Features

Lazy evaluation: Transformations build a DAG; actions trigger execution
Catalyst optimizer: Rewrites query plans for optimal execution
Tungsten engine: Off-heap memory management, code generation
Unified API: Batch, streaming (Structured Streaming), ML, graph all in one

Spark Structured Streaming

             +-- Read from Kafka --+
             |                     |
             v                     v
  [Micro-batch 1]          [Micro-batch 2]
       |                         |
   [Process]                 [Process]
       |                         |
   [Write to sink]           [Write to sink]

Near-real-time (seconds latency) with batch-like programming model.

Apache Flink

True Streaming

Unlike Spark's micro-batch approach, Flink processes each event individually as it arrives.

Source --> [Operator] --> [Operator] --> [Operator] --> Sink
             ^               ^              ^
         (stateful)     (stateful)     (stateful)
       per-event processing with checkpointed state

Event Time vs Processing Time

Concept	Definition	Example
Event time	When the event actually occurred	Sensor reading timestamp
Processing time	When the system processes the event	Wall clock at ingestion
Ingestion time	When the event enters the system	Kafka append time

Why it matters: Events arrive out of order. A click at 10:00:01 may arrive at 10:00:05. Event-time processing gives correct results despite network delays.

Watermarks

Watermarks tell Flink "all events up to time T have arrived."

Timeline:

Events:     e(9:58)  e(10:01)  e(9:59)  e(10:02)  W(10:00)  e(10:03)
                                                       ^
                                          Watermark: "no more events < 10:00"
                                          Window [9:55-10:00] can now close

Heuristic watermarks: Allow configurable late-event tolerance
Late events: Can be dropped, redirected to a side output, or trigger updates

Flink vs Spark Streaming

Feature	Flink	Spark Structured Streaming
Model	True event-at-a-time	Micro-batch
Latency	Milliseconds	Seconds
Event time	First-class support	Supported but bolt-on
State management	Built-in, queryable	Via state stores
Exactly-once	Native via checkpoints	Via idempotent sinks
Batch	Supported (streaming is core)	Native

ETL vs ELT

ETL:   Source --> [Transform] --> Load into Warehouse
ELT:   Source --> Load into Lake/Warehouse --> [Transform in-place]

Aspect	ETL	ELT
Transform location	Dedicated ETL server	Inside the target system
Data loaded	Clean, structured	Raw, then transformed
Scalability	Limited by ETL server	Leverages warehouse compute
Flexibility	Schema-on-write	Schema-on-read
Modern trend	Legacy (Informatica)	Cloud-native (dbt, BigQuery, Snowflake)

Data Lake vs Data Warehouse vs Data Lakehouse

+------------------+  +------------------+  +-------------------------+
|   Data Lake      |  |  Data Warehouse  |  |    Data Lakehouse       |
|                  |  |                  |  |                         |
| Raw files (S3)   |  | Structured tables|  | Raw files + table layer |
| Schema-on-read   |  | Schema-on-write  |  | ACID + schema evolution |
| Cheap storage    |  | Expensive compute|  | Best of both            |
| No ACID          |  | ACID guarantees  |  | Delta Lake / Iceberg    |
+------------------+  +------------------+  +-------------------------+

Feature	Data Lake	Data Warehouse	Data Lakehouse
Storage cost	Low (object storage)	High (proprietary)	Low
Query performance	Slow without optimization	Fast (indexes, columnar)	Fast (file-level metadata)
ACID transactions	No	Yes	Yes
Data types	Any (structured, semi, unstructured)	Structured only	Any
Examples	S3 + Athena	Snowflake, BigQuery, Redshift	Delta Lake, Apache Iceberg, Hudi

Change Data Capture (CDC)

CDC captures row-level changes (INSERT, UPDATE, DELETE) from a database and streams them downstream.

CDC with Debezium

+----------+      +----------+      +-------+      +-----------+
| Postgres | ---> | Debezium | ---> | Kafka | ---> | Consumer  |
| (source) |      | Connector|      |       |      | (Flink,   |
| WAL/binlog      +----------+      +-------+      |  Spark,   |
+----------+                                        |  warehouse)
                                                    +-----------+

How it works:

Debezium reads the database's transaction log (WAL for Postgres, binlog for MySQL)
Converts each change to a structured event
Publishes to Kafka topic (one per table)
Downstream consumers process changes in near-real-time

Benefits:

No polling, no queries against source DB
Captures all changes including deletes
Preserves ordering within a partition
Minimal impact on source database performance

Lambda Architecture vs Kappa Architecture

Lambda Architecture

                    +-- [Batch Layer] ---> Batch View --+
Raw Data --> Kafka -|                                    |--> Serving Layer --> Query
                    +-- [Speed Layer] ---> Real-time View+

Batch layer: Recomputes complete views periodically (correctness)
Speed layer: Processes recent data in real-time (low latency)
Serving layer: Merges both views for queries

Problem: You maintain two codebases (batch + stream) doing the same logic.

Kappa Architecture

Raw Data --> Kafka --> [Stream Processor] --> Serving Layer --> Query
                ^                                  |
                |--- Replay from Kafka if needed --|

Single processing path: Everything is a stream
Reprocessing: Replay the Kafka log with updated logic
Simpler: One codebase, one pipeline

Comparison

Aspect	Lambda	Kappa
Complexity	High (two pipelines)	Lower (one pipeline)
Correctness	Batch ensures accuracy	Relies on stream correctness
Reprocessing	Native (batch recomputes)	Replay from log
Latency	Real-time via speed layer	Real-time native
Use case	When batch correctness is critical	When stream processing is sufficient

Choosing Batch vs Stream

Is low latency required (< 1 min)?
  |
  YES --> Stream processing
  |         |
  |         Is exactly-once critical?
  |           YES --> Flink with checkpointing
  |           NO  --> Kafka Streams / simple consumer
  |
  NO --> Is the dataset bounded?
           YES --> Batch (Spark, traditional ETL)
           NO  --> Micro-batch (Spark Structured Streaming)

Decision Factors

Factor	Favors Batch	Favors Stream
Latency tolerance	Hours/minutes OK	Seconds/sub-second needed
Data completeness	Need full picture	Partial/approximate OK
Resource cost	Cheaper (bursty compute)	Continuous compute cost
Complexity budget	Lower (bounded input)	Higher (state, ordering)
Use case examples	Daily reports, ML training	Fraud detection, live feed

Interview Tips

Don't default to one approach. State requirements first, then justify batch vs stream.
Know Spark vs Flink trade-offs cold. Interviewers at data-heavy companies (Meta, Netflix) will probe here.
CDC is a modern must-know. "How do you keep the search index in sync with the database?" = CDC + Kafka.
Lambda vs Kappa is a classic. Know why Kappa simplifies but when Lambda is still justified.
Always mention exactly-once semantics. Show you understand the hard problem in stream processing.
Data lakehouse is the current trend. Delta Lake / Iceberg show you follow the industry.

Common Interview Questions

"Design a real-time analytics pipeline for ad click tracking"
"How would you rebuild a search index from a database?"
"Your daily batch job takes 8 hours. Users want fresher data. What do you do?"
"Explain the difference between event time and processing time with an example"
"How does exactly-once processing work in Kafka + Flink?"

Resources

DDIA Chapter 10: Batch Processing (MapReduce, dataflow engines)
DDIA Chapter 11: Stream Processing (event streams, state, joins)
Spark: The Definitive Guide - Chambers & Zaharia
Streaming Systems - Akidau, Chernyak, Lax (the watermarks bible)
Confluent blog: "Putting Apache Kafka to Use" series
Martin Kleppmann's talks on event sourcing and stream processing

Previous: 22 - TODO | Next: 24 - Reliability Engineering