23 - Data Pipelines & ETL
Previous: 22 - TODO | Next: 24 - Reliability Engineering
Why This Matters in Interviews
Data pipelines are the backbone of analytics, ML, and real-time features at FAANG scale. Interviewers expect you to reason about batch vs stream, trade-offs between latency and throughput, and how real systems like Spark/Flink fit together.
Batch vs Stream Processing
| Dimension | Batch | Stream |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high (optimized for bulk) | Variable (per-event overhead) |
| Completeness | Processes complete datasets | Works with incomplete/unbounded data |
| Complexity | Simpler (bounded input) | Harder (ordering, late data, state) |
| Fault tolerance | Rerun the whole job | Checkpointing, exactly-once semantics |
| Use cases | Reports, training ML models, backfills | Fraud detection, live dashboards, alerts |
MapReduce
Concept
A programming model for processing large datasets in parallel across a cluster.
Input Data
|
[Split] --> [Map] --> [Shuffle & Sort] --> [Reduce] --> Output
[Split] --> [Map] ------/ [Reduce]
[Split] --> [Map] -----/ [Reduce]
Word Count Example (Pseudocode)
MAP(document):
for each word in document:
emit(word, 1)
REDUCE(word, counts[]):
emit(word, sum(counts))
Input: "the cat sat on the mat"
Map output: (the,1) (cat,1) (sat,1) (on,1) (the,1) (mat,1)
Shuffle: the -> [1,1] cat -> [1] sat -> [1] ...
Reduce: the -> 2 cat -> 1 sat -> 1 ...
Limitations
- Disk I/O bottleneck: Intermediate results written to disk between stages
- No iterative processing: Each iteration = full read/write cycle (bad for ML)
- High latency: Not suitable for real-time or interactive queries
- Rigid two-phase model: Complex pipelines require chaining multiple MR jobs
- Programming complexity: Low-level API compared to SQL-like alternatives
Apache Spark
Why Faster Than MapReduce
MapReduce: Disk --> Map --> Disk --> Reduce --> Disk
Spark: Disk --> [Map --> Filter --> Reduce] (all in-memory) --> Disk
Spark keeps intermediate data in memory (up to 100x faster for iterative workloads).
Core Abstractions
| Abstraction | Description | When to Use |
|---|---|---|
| RDD (Resilient Distributed Dataset) | Immutable distributed collection, lazy evaluation | Low-level control, custom partitioning |
| DataFrame | Distributed table with named columns, optimized by Catalyst | Most ETL, SQL-like operations |
| Dataset | Type-safe DataFrame (JVM languages) | Type safety in Scala/Java |
Key Features
- Lazy evaluation: Transformations build a DAG; actions trigger execution
- Catalyst optimizer: Rewrites query plans for optimal execution
- Tungsten engine: Off-heap memory management, code generation
- Unified API: Batch, streaming (Structured Streaming), ML, graph all in one
Spark Structured Streaming
+-- Read from Kafka --+
| |
v v
[Micro-batch 1] [Micro-batch 2]
| |
[Process] [Process]
| |
[Write to sink] [Write to sink]
Near-real-time (seconds latency) with batch-like programming model.
Apache Flink
True Streaming
Unlike Spark's micro-batch approach, Flink processes each event individually as it arrives.
Source --> [Operator] --> [Operator] --> [Operator] --> Sink
^ ^ ^
(stateful) (stateful) (stateful)
per-event processing with checkpointed state
Event Time vs Processing Time
| Concept | Definition | Example |
|---|---|---|
| Event time | When the event actually occurred | Sensor reading timestamp |
| Processing time | When the system processes the event | Wall clock at ingestion |
| Ingestion time | When the event enters the system | Kafka append time |
Why it matters: Events arrive out of order. A click at 10:00:01 may arrive at 10:00:05. Event-time processing gives correct results despite network delays.
Watermarks
Watermarks tell Flink "all events up to time T have arrived."
Timeline:
Events: e(9:58) e(10:01) e(9:59) e(10:02) W(10:00) e(10:03)
^
Watermark: "no more events < 10:00"
Window [9:55-10:00] can now close
- Heuristic watermarks: Allow configurable late-event tolerance
- Late events: Can be dropped, redirected to a side output, or trigger updates
Flink vs Spark Streaming
| Feature | Flink | Spark Structured Streaming |
|---|---|---|
| Model | True event-at-a-time | Micro-batch |
| Latency | Milliseconds | Seconds |
| Event time | First-class support | Supported but bolt-on |
| State management | Built-in, queryable | Via state stores |
| Exactly-once | Native via checkpoints | Via idempotent sinks |
| Batch | Supported (streaming is core) | Native |
ETL vs ELT
ETL: Source --> [Transform] --> Load into Warehouse
ELT: Source --> Load into Lake/Warehouse --> [Transform in-place]
| Aspect | ETL | ELT |
|---|---|---|
| Transform location | Dedicated ETL server | Inside the target system |
| Data loaded | Clean, structured | Raw, then transformed |
| Scalability | Limited by ETL server | Leverages warehouse compute |
| Flexibility | Schema-on-write | Schema-on-read |
| Modern trend | Legacy (Informatica) | Cloud-native (dbt, BigQuery, Snowflake) |
Data Lake vs Data Warehouse vs Data Lakehouse
+------------------+ +------------------+ +-------------------------+
| Data Lake | | Data Warehouse | | Data Lakehouse |
| | | | | |
| Raw files (S3) | | Structured tables| | Raw files + table layer |
| Schema-on-read | | Schema-on-write | | ACID + schema evolution |
| Cheap storage | | Expensive compute| | Best of both |
| No ACID | | ACID guarantees | | Delta Lake / Iceberg |
+------------------+ +------------------+ +-------------------------+
| Feature | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Storage cost | Low (object storage) | High (proprietary) | Low |
| Query performance | Slow without optimization | Fast (indexes, columnar) | Fast (file-level metadata) |
| ACID transactions | No | Yes | Yes |
| Data types | Any (structured, semi, unstructured) | Structured only | Any |
| Examples | S3 + Athena | Snowflake, BigQuery, Redshift | Delta Lake, Apache Iceberg, Hudi |
Change Data Capture (CDC)
CDC captures row-level changes (INSERT, UPDATE, DELETE) from a database and streams them downstream.
CDC with Debezium
+----------+ +----------+ +-------+ +-----------+
| Postgres | ---> | Debezium | ---> | Kafka | ---> | Consumer |
| (source) | | Connector| | | | (Flink, |
| WAL/binlog +----------+ +-------+ | Spark, |
+----------+ | warehouse)
+-----------+
How it works:
- Debezium reads the database's transaction log (WAL for Postgres, binlog for MySQL)
- Converts each change to a structured event
- Publishes to Kafka topic (one per table)
- Downstream consumers process changes in near-real-time
Benefits:
- No polling, no queries against source DB
- Captures all changes including deletes
- Preserves ordering within a partition
- Minimal impact on source database performance
Lambda Architecture vs Kappa Architecture
Lambda Architecture
+-- [Batch Layer] ---> Batch View --+
Raw Data --> Kafka -| |--> Serving Layer --> Query
+-- [Speed Layer] ---> Real-time View+
- Batch layer: Recomputes complete views periodically (correctness)
- Speed layer: Processes recent data in real-time (low latency)
- Serving layer: Merges both views for queries
Problem: You maintain two codebases (batch + stream) doing the same logic.
Kappa Architecture
Raw Data --> Kafka --> [Stream Processor] --> Serving Layer --> Query
^ |
|--- Replay from Kafka if needed --|
- Single processing path: Everything is a stream
- Reprocessing: Replay the Kafka log with updated logic
- Simpler: One codebase, one pipeline
Comparison
| Aspect | Lambda | Kappa |
|---|---|---|
| Complexity | High (two pipelines) | Lower (one pipeline) |
| Correctness | Batch ensures accuracy | Relies on stream correctness |
| Reprocessing | Native (batch recomputes) | Replay from log |
| Latency | Real-time via speed layer | Real-time native |
| Use case | When batch correctness is critical | When stream processing is sufficient |
Choosing Batch vs Stream
Is low latency required (< 1 min)?
|
YES --> Stream processing
| |
| Is exactly-once critical?
| YES --> Flink with checkpointing
| NO --> Kafka Streams / simple consumer
|
NO --> Is the dataset bounded?
YES --> Batch (Spark, traditional ETL)
NO --> Micro-batch (Spark Structured Streaming)
Decision Factors
| Factor | Favors Batch | Favors Stream |
|---|---|---|
| Latency tolerance | Hours/minutes OK | Seconds/sub-second needed |
| Data completeness | Need full picture | Partial/approximate OK |
| Resource cost | Cheaper (bursty compute) | Continuous compute cost |
| Complexity budget | Lower (bounded input) | Higher (state, ordering) |
| Use case examples | Daily reports, ML training | Fraud detection, live feed |
Interview Tips
- Don't default to one approach. State requirements first, then justify batch vs stream.
- Know Spark vs Flink trade-offs cold. Interviewers at data-heavy companies (Meta, Netflix) will probe here.
- CDC is a modern must-know. "How do you keep the search index in sync with the database?" = CDC + Kafka.
- Lambda vs Kappa is a classic. Know why Kappa simplifies but when Lambda is still justified.
- Always mention exactly-once semantics. Show you understand the hard problem in stream processing.
- Data lakehouse is the current trend. Delta Lake / Iceberg show you follow the industry.
Common Interview Questions
- "Design a real-time analytics pipeline for ad click tracking"
- "How would you rebuild a search index from a database?"
- "Your daily batch job takes 8 hours. Users want fresher data. What do you do?"
- "Explain the difference between event time and processing time with an example"
- "How does exactly-once processing work in Kafka + Flink?"
Resources
- DDIA Chapter 10: Batch Processing (MapReduce, dataflow engines)
- DDIA Chapter 11: Stream Processing (event streams, state, joins)
- Spark: The Definitive Guide - Chambers & Zaharia
- Streaming Systems - Akidau, Chernyak, Lax (the watermarks bible)
- Confluent blog: "Putting Apache Kafka to Use" series
- Martin Kleppmann's talks on event sourcing and stream processing
Previous: 22 - TODO | Next: 24 - Reliability Engineering