23 - Data Pipelines & ETL

Previous: 22 - TODO | Next: 24 - Reliability Engineering


Why This Matters in Interviews

Data pipelines are the backbone of analytics, ML, and real-time features at FAANG scale. Interviewers expect you to reason about batch vs stream, trade-offs between latency and throughput, and how real systems like Spark/Flink fit together.


Batch vs Stream Processing

DimensionBatchStream
LatencyMinutes to hoursMilliseconds to seconds
ThroughputVery high (optimized for bulk)Variable (per-event overhead)
CompletenessProcesses complete datasetsWorks with incomplete/unbounded data
ComplexitySimpler (bounded input)Harder (ordering, late data, state)
Fault toleranceRerun the whole jobCheckpointing, exactly-once semantics
Use casesReports, training ML models, backfillsFraud detection, live dashboards, alerts

MapReduce

Concept

A programming model for processing large datasets in parallel across a cluster.

Input Data
    |
  [Split] --> [Map] --> [Shuffle & Sort] --> [Reduce] --> Output
  [Split] --> [Map] ------/                  [Reduce]
  [Split] --> [Map] -----/                   [Reduce]

Word Count Example (Pseudocode)

MAP(document):
  for each word in document:
    emit(word, 1)

REDUCE(word, counts[]):
  emit(word, sum(counts))

Input: "the cat sat on the mat"

Map output:  (the,1) (cat,1) (sat,1) (on,1) (the,1) (mat,1)
Shuffle:     the -> [1,1]   cat -> [1]   sat -> [1]  ...
Reduce:      the -> 2       cat -> 1     sat -> 1    ...

Limitations

  • Disk I/O bottleneck: Intermediate results written to disk between stages
  • No iterative processing: Each iteration = full read/write cycle (bad for ML)
  • High latency: Not suitable for real-time or interactive queries
  • Rigid two-phase model: Complex pipelines require chaining multiple MR jobs
  • Programming complexity: Low-level API compared to SQL-like alternatives

Apache Spark

Why Faster Than MapReduce

MapReduce:  Disk --> Map --> Disk --> Reduce --> Disk
Spark:      Disk --> [Map --> Filter --> Reduce] (all in-memory) --> Disk

Spark keeps intermediate data in memory (up to 100x faster for iterative workloads).

Core Abstractions

AbstractionDescriptionWhen to Use
RDD (Resilient Distributed Dataset)Immutable distributed collection, lazy evaluationLow-level control, custom partitioning
DataFrameDistributed table with named columns, optimized by CatalystMost ETL, SQL-like operations
DatasetType-safe DataFrame (JVM languages)Type safety in Scala/Java

Key Features

  • Lazy evaluation: Transformations build a DAG; actions trigger execution
  • Catalyst optimizer: Rewrites query plans for optimal execution
  • Tungsten engine: Off-heap memory management, code generation
  • Unified API: Batch, streaming (Structured Streaming), ML, graph all in one

Spark Structured Streaming

             +-- Read from Kafka --+
             |                     |
             v                     v
  [Micro-batch 1]          [Micro-batch 2]
       |                         |
   [Process]                 [Process]
       |                         |
   [Write to sink]           [Write to sink]

Near-real-time (seconds latency) with batch-like programming model.


Apache Flink

True Streaming

Unlike Spark's micro-batch approach, Flink processes each event individually as it arrives.

Source --> [Operator] --> [Operator] --> [Operator] --> Sink
             ^               ^              ^
         (stateful)     (stateful)     (stateful)
       per-event processing with checkpointed state

Event Time vs Processing Time

ConceptDefinitionExample
Event timeWhen the event actually occurredSensor reading timestamp
Processing timeWhen the system processes the eventWall clock at ingestion
Ingestion timeWhen the event enters the systemKafka append time

Why it matters: Events arrive out of order. A click at 10:00:01 may arrive at 10:00:05. Event-time processing gives correct results despite network delays.

Watermarks

Watermarks tell Flink "all events up to time T have arrived."

Timeline:

Events:     e(9:58)  e(10:01)  e(9:59)  e(10:02)  W(10:00)  e(10:03)
                                                       ^
                                          Watermark: "no more events < 10:00"
                                          Window [9:55-10:00] can now close
  • Heuristic watermarks: Allow configurable late-event tolerance
  • Late events: Can be dropped, redirected to a side output, or trigger updates

Flink vs Spark Streaming

FeatureFlinkSpark Structured Streaming
ModelTrue event-at-a-timeMicro-batch
LatencyMillisecondsSeconds
Event timeFirst-class supportSupported but bolt-on
State managementBuilt-in, queryableVia state stores
Exactly-onceNative via checkpointsVia idempotent sinks
BatchSupported (streaming is core)Native

ETL vs ELT

ETL:   Source --> [Transform] --> Load into Warehouse
ELT:   Source --> Load into Lake/Warehouse --> [Transform in-place]
AspectETLELT
Transform locationDedicated ETL serverInside the target system
Data loadedClean, structuredRaw, then transformed
ScalabilityLimited by ETL serverLeverages warehouse compute
FlexibilitySchema-on-writeSchema-on-read
Modern trendLegacy (Informatica)Cloud-native (dbt, BigQuery, Snowflake)

Data Lake vs Data Warehouse vs Data Lakehouse

+------------------+  +------------------+  +-------------------------+
|   Data Lake      |  |  Data Warehouse  |  |    Data Lakehouse       |
|                  |  |                  |  |                         |
| Raw files (S3)   |  | Structured tables|  | Raw files + table layer |
| Schema-on-read   |  | Schema-on-write  |  | ACID + schema evolution |
| Cheap storage    |  | Expensive compute|  | Best of both            |
| No ACID          |  | ACID guarantees  |  | Delta Lake / Iceberg    |
+------------------+  +------------------+  +-------------------------+
FeatureData LakeData WarehouseData Lakehouse
Storage costLow (object storage)High (proprietary)Low
Query performanceSlow without optimizationFast (indexes, columnar)Fast (file-level metadata)
ACID transactionsNoYesYes
Data typesAny (structured, semi, unstructured)Structured onlyAny
ExamplesS3 + AthenaSnowflake, BigQuery, RedshiftDelta Lake, Apache Iceberg, Hudi

Change Data Capture (CDC)

CDC captures row-level changes (INSERT, UPDATE, DELETE) from a database and streams them downstream.

CDC with Debezium

+----------+      +----------+      +-------+      +-----------+
| Postgres | ---> | Debezium | ---> | Kafka | ---> | Consumer  |
| (source) |      | Connector|      |       |      | (Flink,   |
| WAL/binlog      +----------+      +-------+      |  Spark,   |
+----------+                                        |  warehouse)
                                                    +-----------+

How it works:

  1. Debezium reads the database's transaction log (WAL for Postgres, binlog for MySQL)
  2. Converts each change to a structured event
  3. Publishes to Kafka topic (one per table)
  4. Downstream consumers process changes in near-real-time

Benefits:

  • No polling, no queries against source DB
  • Captures all changes including deletes
  • Preserves ordering within a partition
  • Minimal impact on source database performance

Lambda Architecture vs Kappa Architecture

Lambda Architecture

                    +-- [Batch Layer] ---> Batch View --+
Raw Data --> Kafka -|                                    |--> Serving Layer --> Query
                    +-- [Speed Layer] ---> Real-time View+
  • Batch layer: Recomputes complete views periodically (correctness)
  • Speed layer: Processes recent data in real-time (low latency)
  • Serving layer: Merges both views for queries

Problem: You maintain two codebases (batch + stream) doing the same logic.

Kappa Architecture

Raw Data --> Kafka --> [Stream Processor] --> Serving Layer --> Query
                ^                                  |
                |--- Replay from Kafka if needed --|
  • Single processing path: Everything is a stream
  • Reprocessing: Replay the Kafka log with updated logic
  • Simpler: One codebase, one pipeline

Comparison

AspectLambdaKappa
ComplexityHigh (two pipelines)Lower (one pipeline)
CorrectnessBatch ensures accuracyRelies on stream correctness
ReprocessingNative (batch recomputes)Replay from log
LatencyReal-time via speed layerReal-time native
Use caseWhen batch correctness is criticalWhen stream processing is sufficient

Choosing Batch vs Stream

Is low latency required (< 1 min)?
  |
  YES --> Stream processing
  |         |
  |         Is exactly-once critical?
  |           YES --> Flink with checkpointing
  |           NO  --> Kafka Streams / simple consumer
  |
  NO --> Is the dataset bounded?
           YES --> Batch (Spark, traditional ETL)
           NO  --> Micro-batch (Spark Structured Streaming)

Decision Factors

FactorFavors BatchFavors Stream
Latency toleranceHours/minutes OKSeconds/sub-second needed
Data completenessNeed full picturePartial/approximate OK
Resource costCheaper (bursty compute)Continuous compute cost
Complexity budgetLower (bounded input)Higher (state, ordering)
Use case examplesDaily reports, ML trainingFraud detection, live feed

Interview Tips

  1. Don't default to one approach. State requirements first, then justify batch vs stream.
  2. Know Spark vs Flink trade-offs cold. Interviewers at data-heavy companies (Meta, Netflix) will probe here.
  3. CDC is a modern must-know. "How do you keep the search index in sync with the database?" = CDC + Kafka.
  4. Lambda vs Kappa is a classic. Know why Kappa simplifies but when Lambda is still justified.
  5. Always mention exactly-once semantics. Show you understand the hard problem in stream processing.
  6. Data lakehouse is the current trend. Delta Lake / Iceberg show you follow the industry.

Common Interview Questions

  • "Design a real-time analytics pipeline for ad click tracking"
  • "How would you rebuild a search index from a database?"
  • "Your daily batch job takes 8 hours. Users want fresher data. What do you do?"
  • "Explain the difference between event time and processing time with an example"
  • "How does exactly-once processing work in Kafka + Flink?"

Resources

  • DDIA Chapter 10: Batch Processing (MapReduce, dataflow engines)
  • DDIA Chapter 11: Stream Processing (event streams, state, joins)
  • Spark: The Definitive Guide - Chambers & Zaharia
  • Streaming Systems - Akidau, Chernyak, Lax (the watermarks bible)
  • Confluent blog: "Putting Apache Kafka to Use" series
  • Martin Kleppmann's talks on event sourcing and stream processing

Previous: 22 - TODO | Next: 24 - Reliability Engineering