46 - ML System Design Basics

Previous: 45 - Geo-Distributed Systems | Next: 47 - Interview Strategy & Frameworks

1. Why ML System Design in FAANG Interviews?

FAANG companies are ML-heavy. Many system design questions involve ML components:

Company	ML-Centric Products
Google	Search ranking, Ads click prediction, YouTube recommendations
Meta	News Feed ranking, Ad targeting, content moderation
Amazon	Product recommendations, Alexa, fraud detection
Netflix	Content recommendations, thumbnail personalization
Apple	Siri, Photos search, App Store ranking

Interviewers test whether you can design the infrastructure around ML models, not whether you can derive backpropagation. You need to understand the full pipeline from data to serving.

2. The ML System Design Pipeline

+-------------+     +----------------+     +-------------+
| Data        |     | Feature        |     | Model       |
| Collection  | --> | Engineering    | --> | Training    |
| & Storage   |     | & Feature Store|     |             |
+-------------+     +----------------+     +------+------+
                                                  |
                                           +------v------+
                                           | Model       |
                                           | Evaluation  |
                                           +------+------+
                                                  |
                                           +------v------+
                                           | Model       |
                                           | Registry    |
                                           +------+------+
                                                  |
                                           +------v------+     +-------------+
                                           | Model       | --> | Monitoring  |
                                           | Serving     |     | & Feedback  |
                                           +-------------+     +-------------+

3. Feature Store

The feature store is a centralized system for managing, storing, and serving ML features.

                    +-------------------+
                    |  Feature Store    |
                    +-------------------+
                    |                   |
          +---------v------+   +--------v--------+
          | Offline Store  |   | Online Store    |
          | (training)     |   | (serving)       |
          | - Hive/S3/BQ   |   | - Redis/DynamoDB|
          | - batch compute|   | - low latency   |
          | - historical   |   | - point lookups |
          +----------------+   +-----------------+

Feature pipeline:
  Raw data (logs, events, DBs)
       |
  Feature computation (Spark, Flink)
       |
  +----+---- Offline store (for training batch jobs)
  |
  +--------- Online store (for real-time inference)

Why a Feature Store?

Problem Without	Solution With Feature Store
Training/serving skew	Same feature definitions used in both
Duplicated feature logic	Centralized, reusable feature definitions
Stale features	Pipeline ensures freshness
Point-in-time correctness	Time-travel queries for training data (avoid label leakage)

Real-World Feature Stores

System	Company	Type
Feast	Open source	Offline + online
Tecton	Startup	Managed feature platform
Michelangelo	Uber	Internal
Zipline	Airbnb	Internal

4. Model Training Pipeline

  +------------------+
  | 1. Data Collection|
  | - Event logs      |
  | - User actions    |
  | - Clickstream     |
  +--------+---------+
           |
  +--------v---------+
  | 2. Data Processing|
  | - Clean, dedupe   |
  | - Join tables     |
  | - Handle missing  |
  | - Train/val/test  |
  +--------+---------+
           |
  +--------v---------+
  | 3. Feature Eng.   |
  | - Numerical: norm |
  | - Categorical: enc|
  | - Text: embed     |
  | - Time: windows   |
  +--------+---------+
           |
  +--------v---------+
  | 4. Model Training |
  | - Select algorithm|
  | - Hyperparameter  |
  |   tuning          |
  | - Cross-validation|
  | - Distributed     |
  |   training (GPU)  |
  +--------+---------+
           |
  +--------v---------+
  | 5. Evaluation     |
  | - Offline metrics |
  |   (AUC, NDCG, F1)|
  | - Bias/fairness   |
  | - Latency budget  |
  +--------+---------+
           |
  +--------v---------+
  | 6. Model Registry |
  | - Version control |
  | - Metadata: params|
  |   metrics, lineage|
  | - Approval gate   |
  +------------------+

Key Training Considerations

Concern	Detail
Training data volume	TB-scale for FAANG; use distributed training (Spark, Ray, Horovod)
Freshness	How often to retrain? Daily, weekly, or continuous?
Label quality	Implicit labels (clicks) are noisy; explicit labels (ratings) are sparse
Data leakage	Never use future data to predict past events; time-based splits
Class imbalance	Click-through rate: 1-2% positive; use downsampling, SMOTE, or weighted loss

5. Model Serving: Batch vs Real-Time

Batch Inference:                    Real-Time Inference:
  Run model on all items            Run model per request
  periodically (hourly/daily)       on-demand
                                    
  +----------+     +--------+       Request --> +--------+ --> Response
  | All items| --> | Model  |                   | Model  |
  | (batch)  |     | Server |       User query  | Server |  Top-K results
  +----------+     +---+----+       + features   +--------+
                       |
                  +----v-----+
                  | Result   |
                  | Store    |
                  | (Redis)  |
                  +----------+

Aspect	Batch	Real-Time
Latency	Pre-computed, instant serving	10-100ms inference per request
Freshness	Stale (hours old)	Live (current context)
Cost	Cheaper (run once)	Expensive (GPU per request)
Personalization	Limited (no session context)	Rich (current session, query)
Use case	Email recommendations, weekly digest	Search ranking, feed ranking

Hybrid Approach (Common in Practice)

Stage 1: Batch - Generate candidate pool (1000s of items per user)
         Run nightly, store in Redis/DynamoDB

Stage 2: Real-time - Re-rank candidates using live features
         User opens app -> fetch candidates -> re-rank with current context
         Return top 20

This avoids scoring all items in real-time (too slow)
while incorporating live signals (session behavior, time of day)

6. A/B Testing Infrastructure

                    User Request
                         |
                    +----v-----+
                    | Feature  |
                    | Flags /  |
                    | Experiment|
                    | Service  |
                    +----+-----+
                         |
              +----------+----------+
              |                     |
        Control (50%)         Treatment (50%)
              |                     |
        Model v1              Model v2
              |                     |
        +-----v------+       +-----v------+
        | Track:     |       | Track:     |
        | - clicks   |       | - clicks   |
        | - revenue  |       | - revenue  |
        | - sessions |       | - sessions |
        +------------+       +------------+
                    \         /
                     +-------+
                     | Stats  |
                     | Engine |
                     | (p-value, CI) |
                     +--------+

A/B Testing Key Concepts

Concept	Detail
Metric hierarchy	Primary (e.g., revenue), secondary (e.g., clicks), guardrail (e.g., latency)
Statistical significance	p-value < 0.05, typically need 1-2 weeks of data
Sample size	Enough traffic per variant for statistical power
Novelty effect	New models may perform better initially; measure long-term impact
Network effects	Social features need cluster-based (not user-based) randomization

7. Feature Engineering at Scale

Common Feature Types

Type	Examples	Computation
User features	Age, country, account age, purchase history	Batch (daily update)
Item features	Category, price, rating, description embedding	Batch
Cross features	User-item interaction count, user-category affinity	Batch or streaming
Context features	Time of day, device type, current session length	Real-time
Embedding features	User embedding, item embedding (learned)	Training-time, served online

Feature Computation Patterns

Batch features (Spark/BigQuery, daily):
  user_purchase_count_30d = COUNT(purchases WHERE date > now - 30d)
  item_avg_rating = AVG(ratings) GROUP BY item_id

Streaming features (Flink/Kafka Streams, real-time):
  user_click_count_last_5min = COUNT(clicks WHERE timestamp > now - 5min)
  trending_score = COUNT(views_last_1h) / COUNT(views_last_24h)

Near-real-time features (micro-batch, 5-15 min):
  user_session_length = now - session_start
  cart_total = SUM(item prices in current cart)

8. Recommendation Systems

The most common ML system design question in FAANG interviews.

Approaches

1. Collaborative Filtering:
   "Users who liked X also liked Y"
   
   User-Item Matrix:
              Item1  Item2  Item3  Item4
   User A  [  5      3      ?      1   ]
   User B  [  4      ?      4      1   ]
   User C  [  ?      3      5      ?   ]
   
   Find similar users (cosine similarity) -> predict missing ratings

2. Content-Based:
   "Since you liked sci-fi movie X, here's another sci-fi movie"
   
   Item features -> similarity in feature space
   User profile = aggregation of liked items' features

3. Hybrid (most practical):
   Combine collaborative + content signals
   Deep learning models (Two-Tower, DeepFM, Wide & Deep)

Two-Stage Recommendation Architecture

  User opens app
       |
  +----v----------+
  | Candidate      |  Stage 1: Retrieve ~1000 candidates from millions
  | Generation     |  Methods: ANN search, collaborative filtering,
  | (Recall)       |           popularity, category-based
  +----+-----------+
       |  ~1000 candidates
  +----v----------+
  | Ranking        |  Stage 2: Score and rank 1000 candidates
  | (Precision)    |  Model: Deep neural network with rich features
  +----+-----------+  (user, item, context, cross features)
       |  ranked list
  +----v----------+
  | Re-ranking     |  Stage 3: Business rules, diversity, freshness
  | (Policy)       |  Remove duplicates, enforce content policy
  +----+-----------+  Inject sponsored content
       |
  +----v----------+
  | Final Top-K    |  Return 20-50 items to user
  +--------------+

9. Model Monitoring

                Model in Production
                       |
          +------------+------------+
          |            |            |
    +-----v-----+ +---v------+ +--v---------+
    | Data      | | Model    | | Business   |
    | Quality   | | Perf.    | | Metrics    |
    | Monitoring| | Monitor  | | Monitor    |
    +-----------+ +----------+ +------------+
    - Missing    - Prediction   - CTR
      features     distribution  - Revenue
    - Schema      shifts        - User
      violations  - Accuracy      engagement
    - Volume       degradation
      anomalies

Key Monitoring Concepts

Concept	What to Track	Action
Data drift	Feature distributions shift from training data	Retrain model
Concept drift	Relationship between features and labels changes	Retrain with recent data
Model degradation	Offline metrics (AUC) drop over time	Trigger retraining pipeline
Prediction distribution	Model outputs shift (e.g., predicting 0 for everything)	Alert, investigate
Latency	Inference time exceeds budget	Optimize model, scale infrastructure
Staleness	Features are too old	Fix feature pipeline

Interview Tip

Always mention monitoring in ML system design. A model that isn't monitored will silently degrade. Mention that you'd track both model metrics (AUC, precision) AND business metrics (CTR, revenue) because a model can have great AUC but hurt business metrics due to feedback loops or population shift.

10. MLOps: CI/CD for ML

Traditional CI/CD:
  Code change -> Build -> Test -> Deploy

ML CI/CD:
  Code change ──┐
  Data change ──┼──> Validate -> Train -> Evaluate -> Deploy -> Monitor
  Config change ┘

Pipeline orchestration: Airflow, Kubeflow Pipelines, Metaflow
Model registry: MLflow, Vertex AI Model Registry
Experiment tracking: MLflow, Weights & Biases

MLOps Maturity Levels

Level	Description	Automation
0	Manual	Scientists train locally, hand-off model file to eng
1	ML pipeline	Automated training pipeline, manual deployment
2	CI/CD for ML	Automated training, testing, and deployment
3	Full MLOps	Automated monitoring, retraining triggers, canary deployment

11. Common ML System Design Interview Questions

Question	Key Focus Areas
Design a recommendation engine	Two-stage (retrieval + ranking), feature store, A/B testing
Design ad click prediction	Real-time inference, feature engineering, calibration, auction integration
Design content ranking (News Feed)	Multi-objective ranking, freshness vs relevance, diversity
Design search ranking	Query understanding, document retrieval, learning-to-rank
Design content moderation	Multi-modal (text, image, video), human-in-the-loop, precision vs recall trade-off
Design fraud detection	Imbalanced data, real-time scoring, explainability for review
Design notification system	When/what to notify, user fatigue modeling, optimal send time
Design similar items	Embedding space, ANN (approximate nearest neighbor), cold start

Framework for Answering ML System Design

1. Clarify the Problem (2 min)
   - What are we optimizing? (clicks, revenue, engagement)
   - Scale: users, items, QPS
   - Latency requirements

2. Define Metrics (3 min)
   - Offline: AUC, NDCG, precision@K, recall@K
   - Online: CTR, conversion rate, session length, revenue
   - Guardrail: latency, diversity, fairness

3. High-Level Architecture (10 min)
   - Data pipeline -> Feature store -> Training -> Serving
   - Candidate generation -> Ranking -> Re-ranking
   - Online vs offline components

4. Deep Dive on Key Components (15 min)
   - Feature engineering (what signals matter?)
   - Model choice and training
   - Serving infrastructure (batch vs real-time)
   - Handling cold start

5. Monitoring & Iteration (5 min)
   - A/B testing setup
   - Model monitoring (drift, degradation)
   - Feedback loops

12. Key Trade-offs Discussion

Decision	Option A	Option B
Serving	Batch (simple, stale)	Real-time (complex, fresh)
Model complexity	Simple (logistic regression, fast)	Deep model (better accuracy, slower)
Features	Few curated features (fast, interpretable)	Many auto-features (better perf, harder to debug)
Retraining	Daily (stable, slightly stale)	Continuous (freshest, complex infra)
Embedding search	Exact (slow, perfect)	ANN/HNSW (fast, approximate)
Personalization	Global model (simpler)	Per-user model (better, expensive)

13. Interview Checklist

Clarified optimization objective and scale
Described the full pipeline (data -> features -> training -> serving -> monitoring)
Explained feature store (online + offline, training-serving consistency)
Discussed model serving strategy (batch vs real-time vs hybrid)
Covered A/B testing infrastructure
Addressed recommendation system architecture (retrieval + ranking)
Mentioned model monitoring (data drift, concept drift, business metrics)
Discussed cold-start problem and solutions
Outlined MLOps (CI/CD for ML, automated retraining)

14. Resources

Book: "Designing Machine Learning Systems" by Chip Huyen (O'Reilly)
Book: "Machine Learning System Design Interview" by Ali Aminian & Alex Xu
Stanford CS 329S -- Machine Learning Systems Design (course materials online)
Eugune Yan's Blog -- eugeneyan.com (ML systems at Amazon)
Paper: "Wide & Deep Learning for Recommender Systems" (Google, 2016)
Paper: "Deep Neural Networks for YouTube Recommendations" (Google, 2016)
YouTube: Stanford MLSys Seminars
Feast documentation -- feast.dev (open-source feature store)

Previous: 45 - Geo-Distributed Systems | Next: 47 - Interview Strategy & Frameworks