46 - ML System Design Basics
Previous: 45 - Geo-Distributed Systems | Next: 47 - Interview Strategy & Frameworks
1. Why ML System Design in FAANG Interviews?
FAANG companies are ML-heavy. Many system design questions involve ML components:
| Company | ML-Centric Products |
|---|
| Google | Search ranking, Ads click prediction, YouTube recommendations |
| Meta | News Feed ranking, Ad targeting, content moderation |
| Amazon | Product recommendations, Alexa, fraud detection |
| Netflix | Content recommendations, thumbnail personalization |
| Apple | Siri, Photos search, App Store ranking |
Interviewers test whether you can design the infrastructure around ML models, not whether you can derive backpropagation. You need to understand the full pipeline from data to serving.
2. The ML System Design Pipeline
+-------------+ +----------------+ +-------------+
| Data | | Feature | | Model |
| Collection | --> | Engineering | --> | Training |
| & Storage | | & Feature Store| | |
+-------------+ +----------------+ +------+------+
|
+------v------+
| Model |
| Evaluation |
+------+------+
|
+------v------+
| Model |
| Registry |
+------+------+
|
+------v------+ +-------------+
| Model | --> | Monitoring |
| Serving | | & Feedback |
+-------------+ +-------------+
3. Feature Store
The feature store is a centralized system for managing, storing, and serving ML features.
+-------------------+
| Feature Store |
+-------------------+
| |
+---------v------+ +--------v--------+
| Offline Store | | Online Store |
| (training) | | (serving) |
| - Hive/S3/BQ | | - Redis/DynamoDB|
| - batch compute| | - low latency |
| - historical | | - point lookups |
+----------------+ +-----------------+
Feature pipeline:
Raw data (logs, events, DBs)
|
Feature computation (Spark, Flink)
|
+----+---- Offline store (for training batch jobs)
|
+--------- Online store (for real-time inference)
Why a Feature Store?
| Problem Without | Solution With Feature Store |
|---|
| Training/serving skew | Same feature definitions used in both |
| Duplicated feature logic | Centralized, reusable feature definitions |
| Stale features | Pipeline ensures freshness |
| Point-in-time correctness | Time-travel queries for training data (avoid label leakage) |
Real-World Feature Stores
| System | Company | Type |
|---|
| Feast | Open source | Offline + online |
| Tecton | Startup | Managed feature platform |
| Michelangelo | Uber | Internal |
| Zipline | Airbnb | Internal |
4. Model Training Pipeline
+------------------+
| 1. Data Collection|
| - Event logs |
| - User actions |
| - Clickstream |
+--------+---------+
|
+--------v---------+
| 2. Data Processing|
| - Clean, dedupe |
| - Join tables |
| - Handle missing |
| - Train/val/test |
+--------+---------+
|
+--------v---------+
| 3. Feature Eng. |
| - Numerical: norm |
| - Categorical: enc|
| - Text: embed |
| - Time: windows |
+--------+---------+
|
+--------v---------+
| 4. Model Training |
| - Select algorithm|
| - Hyperparameter |
| tuning |
| - Cross-validation|
| - Distributed |
| training (GPU) |
+--------+---------+
|
+--------v---------+
| 5. Evaluation |
| - Offline metrics |
| (AUC, NDCG, F1)|
| - Bias/fairness |
| - Latency budget |
+--------+---------+
|
+--------v---------+
| 6. Model Registry |
| - Version control |
| - Metadata: params|
| metrics, lineage|
| - Approval gate |
+------------------+
Key Training Considerations
| Concern | Detail |
|---|
| Training data volume | TB-scale for FAANG; use distributed training (Spark, Ray, Horovod) |
| Freshness | How often to retrain? Daily, weekly, or continuous? |
| Label quality | Implicit labels (clicks) are noisy; explicit labels (ratings) are sparse |
| Data leakage | Never use future data to predict past events; time-based splits |
| Class imbalance | Click-through rate: 1-2% positive; use downsampling, SMOTE, or weighted loss |
5. Model Serving: Batch vs Real-Time
Batch Inference: Real-Time Inference:
Run model on all items Run model per request
periodically (hourly/daily) on-demand
+----------+ +--------+ Request --> +--------+ --> Response
| All items| --> | Model | | Model |
| (batch) | | Server | User query | Server | Top-K results
+----------+ +---+----+ + features +--------+
|
+----v-----+
| Result |
| Store |
| (Redis) |
+----------+
| Aspect | Batch | Real-Time |
|---|
| Latency | Pre-computed, instant serving | 10-100ms inference per request |
| Freshness | Stale (hours old) | Live (current context) |
| Cost | Cheaper (run once) | Expensive (GPU per request) |
| Personalization | Limited (no session context) | Rich (current session, query) |
| Use case | Email recommendations, weekly digest | Search ranking, feed ranking |
Hybrid Approach (Common in Practice)
Stage 1: Batch - Generate candidate pool (1000s of items per user)
Run nightly, store in Redis/DynamoDB
Stage 2: Real-time - Re-rank candidates using live features
User opens app -> fetch candidates -> re-rank with current context
Return top 20
This avoids scoring all items in real-time (too slow)
while incorporating live signals (session behavior, time of day)
6. A/B Testing Infrastructure
User Request
|
+----v-----+
| Feature |
| Flags / |
| Experiment|
| Service |
+----+-----+
|
+----------+----------+
| |
Control (50%) Treatment (50%)
| |
Model v1 Model v2
| |
+-----v------+ +-----v------+
| Track: | | Track: |
| - clicks | | - clicks |
| - revenue | | - revenue |
| - sessions | | - sessions |
+------------+ +------------+
\ /
+-------+
| Stats |
| Engine |
| (p-value, CI) |
+--------+
A/B Testing Key Concepts
| Concept | Detail |
|---|
| Metric hierarchy | Primary (e.g., revenue), secondary (e.g., clicks), guardrail (e.g., latency) |
| Statistical significance | p-value < 0.05, typically need 1-2 weeks of data |
| Sample size | Enough traffic per variant for statistical power |
| Novelty effect | New models may perform better initially; measure long-term impact |
| Network effects | Social features need cluster-based (not user-based) randomization |
7. Feature Engineering at Scale
Common Feature Types
| Type | Examples | Computation |
|---|
| User features | Age, country, account age, purchase history | Batch (daily update) |
| Item features | Category, price, rating, description embedding | Batch |
| Cross features | User-item interaction count, user-category affinity | Batch or streaming |
| Context features | Time of day, device type, current session length | Real-time |
| Embedding features | User embedding, item embedding (learned) | Training-time, served online |
Feature Computation Patterns
Batch features (Spark/BigQuery, daily):
user_purchase_count_30d = COUNT(purchases WHERE date > now - 30d)
item_avg_rating = AVG(ratings) GROUP BY item_id
Streaming features (Flink/Kafka Streams, real-time):
user_click_count_last_5min = COUNT(clicks WHERE timestamp > now - 5min)
trending_score = COUNT(views_last_1h) / COUNT(views_last_24h)
Near-real-time features (micro-batch, 5-15 min):
user_session_length = now - session_start
cart_total = SUM(item prices in current cart)
8. Recommendation Systems
The most common ML system design question in FAANG interviews.
Approaches
1. Collaborative Filtering:
"Users who liked X also liked Y"
User-Item Matrix:
Item1 Item2 Item3 Item4
User A [ 5 3 ? 1 ]
User B [ 4 ? 4 1 ]
User C [ ? 3 5 ? ]
Find similar users (cosine similarity) -> predict missing ratings
2. Content-Based:
"Since you liked sci-fi movie X, here's another sci-fi movie"
Item features -> similarity in feature space
User profile = aggregation of liked items' features
3. Hybrid (most practical):
Combine collaborative + content signals
Deep learning models (Two-Tower, DeepFM, Wide & Deep)
Two-Stage Recommendation Architecture
User opens app
|
+----v----------+
| Candidate | Stage 1: Retrieve ~1000 candidates from millions
| Generation | Methods: ANN search, collaborative filtering,
| (Recall) | popularity, category-based
+----+-----------+
| ~1000 candidates
+----v----------+
| Ranking | Stage 2: Score and rank 1000 candidates
| (Precision) | Model: Deep neural network with rich features
+----+-----------+ (user, item, context, cross features)
| ranked list
+----v----------+
| Re-ranking | Stage 3: Business rules, diversity, freshness
| (Policy) | Remove duplicates, enforce content policy
+----+-----------+ Inject sponsored content
|
+----v----------+
| Final Top-K | Return 20-50 items to user
+--------------+
9. Model Monitoring
Model in Production
|
+------------+------------+
| | |
+-----v-----+ +---v------+ +--v---------+
| Data | | Model | | Business |
| Quality | | Perf. | | Metrics |
| Monitoring| | Monitor | | Monitor |
+-----------+ +----------+ +------------+
- Missing - Prediction - CTR
features distribution - Revenue
- Schema shifts - User
violations - Accuracy engagement
- Volume degradation
anomalies
Key Monitoring Concepts
| Concept | What to Track | Action |
|---|
| Data drift | Feature distributions shift from training data | Retrain model |
| Concept drift | Relationship between features and labels changes | Retrain with recent data |
| Model degradation | Offline metrics (AUC) drop over time | Trigger retraining pipeline |
| Prediction distribution | Model outputs shift (e.g., predicting 0 for everything) | Alert, investigate |
| Latency | Inference time exceeds budget | Optimize model, scale infrastructure |
| Staleness | Features are too old | Fix feature pipeline |
Interview Tip
Always mention monitoring in ML system design. A model that isn't monitored will silently degrade. Mention that you'd track both model metrics (AUC, precision) AND business metrics (CTR, revenue) because a model can have great AUC but hurt business metrics due to feedback loops or population shift.
10. MLOps: CI/CD for ML
Traditional CI/CD:
Code change -> Build -> Test -> Deploy
ML CI/CD:
Code change ──┐
Data change ──┼──> Validate -> Train -> Evaluate -> Deploy -> Monitor
Config change ┘
Pipeline orchestration: Airflow, Kubeflow Pipelines, Metaflow
Model registry: MLflow, Vertex AI Model Registry
Experiment tracking: MLflow, Weights & Biases
MLOps Maturity Levels
| Level | Description | Automation |
|---|
| 0 | Manual | Scientists train locally, hand-off model file to eng |
| 1 | ML pipeline | Automated training pipeline, manual deployment |
| 2 | CI/CD for ML | Automated training, testing, and deployment |
| 3 | Full MLOps | Automated monitoring, retraining triggers, canary deployment |
11. Common ML System Design Interview Questions
| Question | Key Focus Areas |
|---|
| Design a recommendation engine | Two-stage (retrieval + ranking), feature store, A/B testing |
| Design ad click prediction | Real-time inference, feature engineering, calibration, auction integration |
| Design content ranking (News Feed) | Multi-objective ranking, freshness vs relevance, diversity |
| Design search ranking | Query understanding, document retrieval, learning-to-rank |
| Design content moderation | Multi-modal (text, image, video), human-in-the-loop, precision vs recall trade-off |
| Design fraud detection | Imbalanced data, real-time scoring, explainability for review |
| Design notification system | When/what to notify, user fatigue modeling, optimal send time |
| Design similar items | Embedding space, ANN (approximate nearest neighbor), cold start |
Framework for Answering ML System Design
1. Clarify the Problem (2 min)
- What are we optimizing? (clicks, revenue, engagement)
- Scale: users, items, QPS
- Latency requirements
2. Define Metrics (3 min)
- Offline: AUC, NDCG, precision@K, recall@K
- Online: CTR, conversion rate, session length, revenue
- Guardrail: latency, diversity, fairness
3. High-Level Architecture (10 min)
- Data pipeline -> Feature store -> Training -> Serving
- Candidate generation -> Ranking -> Re-ranking
- Online vs offline components
4. Deep Dive on Key Components (15 min)
- Feature engineering (what signals matter?)
- Model choice and training
- Serving infrastructure (batch vs real-time)
- Handling cold start
5. Monitoring & Iteration (5 min)
- A/B testing setup
- Model monitoring (drift, degradation)
- Feedback loops
12. Key Trade-offs Discussion
| Decision | Option A | Option B |
|---|
| Serving | Batch (simple, stale) | Real-time (complex, fresh) |
| Model complexity | Simple (logistic regression, fast) | Deep model (better accuracy, slower) |
| Features | Few curated features (fast, interpretable) | Many auto-features (better perf, harder to debug) |
| Retraining | Daily (stable, slightly stale) | Continuous (freshest, complex infra) |
| Embedding search | Exact (slow, perfect) | ANN/HNSW (fast, approximate) |
| Personalization | Global model (simpler) | Per-user model (better, expensive) |
13. Interview Checklist
14. Resources
- Book: "Designing Machine Learning Systems" by Chip Huyen (O'Reilly)
- Book: "Machine Learning System Design Interview" by Ali Aminian & Alex Xu
- Stanford CS 329S -- Machine Learning Systems Design (course materials online)
- Eugune Yan's Blog -- eugeneyan.com (ML systems at Amazon)
- Paper: "Wide & Deep Learning for Recommender Systems" (Google, 2016)
- Paper: "Deep Neural Networks for YouTube Recommendations" (Google, 2016)
- YouTube: Stanford MLSys Seminars
- Feast documentation -- feast.dev (open-source feature store)
Previous: 45 - Geo-Distributed Systems | Next: 47 - Interview Strategy & Frameworks