46 - ML System Design Basics

Previous: 45 - Geo-Distributed Systems | Next: 47 - Interview Strategy & Frameworks


1. Why ML System Design in FAANG Interviews?

FAANG companies are ML-heavy. Many system design questions involve ML components:

CompanyML-Centric Products
GoogleSearch ranking, Ads click prediction, YouTube recommendations
MetaNews Feed ranking, Ad targeting, content moderation
AmazonProduct recommendations, Alexa, fraud detection
NetflixContent recommendations, thumbnail personalization
AppleSiri, Photos search, App Store ranking

Interviewers test whether you can design the infrastructure around ML models, not whether you can derive backpropagation. You need to understand the full pipeline from data to serving.


2. The ML System Design Pipeline

+-------------+     +----------------+     +-------------+
| Data        |     | Feature        |     | Model       |
| Collection  | --> | Engineering    | --> | Training    |
| & Storage   |     | & Feature Store|     |             |
+-------------+     +----------------+     +------+------+
                                                  |
                                           +------v------+
                                           | Model       |
                                           | Evaluation  |
                                           +------+------+
                                                  |
                                           +------v------+
                                           | Model       |
                                           | Registry    |
                                           +------+------+
                                                  |
                                           +------v------+     +-------------+
                                           | Model       | --> | Monitoring  |
                                           | Serving     |     | & Feedback  |
                                           +-------------+     +-------------+

3. Feature Store

The feature store is a centralized system for managing, storing, and serving ML features.

                    +-------------------+
                    |  Feature Store    |
                    +-------------------+
                    |                   |
          +---------v------+   +--------v--------+
          | Offline Store  |   | Online Store    |
          | (training)     |   | (serving)       |
          | - Hive/S3/BQ   |   | - Redis/DynamoDB|
          | - batch compute|   | - low latency   |
          | - historical   |   | - point lookups |
          +----------------+   +-----------------+

Feature pipeline:
  Raw data (logs, events, DBs)
       |
  Feature computation (Spark, Flink)
       |
  +----+---- Offline store (for training batch jobs)
  |
  +--------- Online store (for real-time inference)

Why a Feature Store?

Problem WithoutSolution With Feature Store
Training/serving skewSame feature definitions used in both
Duplicated feature logicCentralized, reusable feature definitions
Stale featuresPipeline ensures freshness
Point-in-time correctnessTime-travel queries for training data (avoid label leakage)

Real-World Feature Stores

SystemCompanyType
FeastOpen sourceOffline + online
TectonStartupManaged feature platform
MichelangeloUberInternal
ZiplineAirbnbInternal

4. Model Training Pipeline

  +------------------+
  | 1. Data Collection|
  | - Event logs      |
  | - User actions    |
  | - Clickstream     |
  +--------+---------+
           |
  +--------v---------+
  | 2. Data Processing|
  | - Clean, dedupe   |
  | - Join tables     |
  | - Handle missing  |
  | - Train/val/test  |
  +--------+---------+
           |
  +--------v---------+
  | 3. Feature Eng.   |
  | - Numerical: norm |
  | - Categorical: enc|
  | - Text: embed     |
  | - Time: windows   |
  +--------+---------+
           |
  +--------v---------+
  | 4. Model Training |
  | - Select algorithm|
  | - Hyperparameter  |
  |   tuning          |
  | - Cross-validation|
  | - Distributed     |
  |   training (GPU)  |
  +--------+---------+
           |
  +--------v---------+
  | 5. Evaluation     |
  | - Offline metrics |
  |   (AUC, NDCG, F1)|
  | - Bias/fairness   |
  | - Latency budget  |
  +--------+---------+
           |
  +--------v---------+
  | 6. Model Registry |
  | - Version control |
  | - Metadata: params|
  |   metrics, lineage|
  | - Approval gate   |
  +------------------+

Key Training Considerations

ConcernDetail
Training data volumeTB-scale for FAANG; use distributed training (Spark, Ray, Horovod)
FreshnessHow often to retrain? Daily, weekly, or continuous?
Label qualityImplicit labels (clicks) are noisy; explicit labels (ratings) are sparse
Data leakageNever use future data to predict past events; time-based splits
Class imbalanceClick-through rate: 1-2% positive; use downsampling, SMOTE, or weighted loss

5. Model Serving: Batch vs Real-Time

Batch Inference:                    Real-Time Inference:
  Run model on all items            Run model per request
  periodically (hourly/daily)       on-demand
                                    
  +----------+     +--------+       Request --> +--------+ --> Response
  | All items| --> | Model  |                   | Model  |
  | (batch)  |     | Server |       User query  | Server |  Top-K results
  +----------+     +---+----+       + features   +--------+
                       |
                  +----v-----+
                  | Result   |
                  | Store    |
                  | (Redis)  |
                  +----------+
AspectBatchReal-Time
LatencyPre-computed, instant serving10-100ms inference per request
FreshnessStale (hours old)Live (current context)
CostCheaper (run once)Expensive (GPU per request)
PersonalizationLimited (no session context)Rich (current session, query)
Use caseEmail recommendations, weekly digestSearch ranking, feed ranking

Hybrid Approach (Common in Practice)

Stage 1: Batch - Generate candidate pool (1000s of items per user)
         Run nightly, store in Redis/DynamoDB

Stage 2: Real-time - Re-rank candidates using live features
         User opens app -> fetch candidates -> re-rank with current context
         Return top 20

This avoids scoring all items in real-time (too slow)
while incorporating live signals (session behavior, time of day)

6. A/B Testing Infrastructure

                    User Request
                         |
                    +----v-----+
                    | Feature  |
                    | Flags /  |
                    | Experiment|
                    | Service  |
                    +----+-----+
                         |
              +----------+----------+
              |                     |
        Control (50%)         Treatment (50%)
              |                     |
        Model v1              Model v2
              |                     |
        +-----v------+       +-----v------+
        | Track:     |       | Track:     |
        | - clicks   |       | - clicks   |
        | - revenue  |       | - revenue  |
        | - sessions |       | - sessions |
        +------------+       +------------+
                    \         /
                     +-------+
                     | Stats  |
                     | Engine |
                     | (p-value, CI) |
                     +--------+

A/B Testing Key Concepts

ConceptDetail
Metric hierarchyPrimary (e.g., revenue), secondary (e.g., clicks), guardrail (e.g., latency)
Statistical significancep-value < 0.05, typically need 1-2 weeks of data
Sample sizeEnough traffic per variant for statistical power
Novelty effectNew models may perform better initially; measure long-term impact
Network effectsSocial features need cluster-based (not user-based) randomization

7. Feature Engineering at Scale

Common Feature Types

TypeExamplesComputation
User featuresAge, country, account age, purchase historyBatch (daily update)
Item featuresCategory, price, rating, description embeddingBatch
Cross featuresUser-item interaction count, user-category affinityBatch or streaming
Context featuresTime of day, device type, current session lengthReal-time
Embedding featuresUser embedding, item embedding (learned)Training-time, served online

Feature Computation Patterns

Batch features (Spark/BigQuery, daily):
  user_purchase_count_30d = COUNT(purchases WHERE date > now - 30d)
  item_avg_rating = AVG(ratings) GROUP BY item_id

Streaming features (Flink/Kafka Streams, real-time):
  user_click_count_last_5min = COUNT(clicks WHERE timestamp > now - 5min)
  trending_score = COUNT(views_last_1h) / COUNT(views_last_24h)

Near-real-time features (micro-batch, 5-15 min):
  user_session_length = now - session_start
  cart_total = SUM(item prices in current cart)

8. Recommendation Systems

The most common ML system design question in FAANG interviews.

Approaches

1. Collaborative Filtering:
   "Users who liked X also liked Y"
   
   User-Item Matrix:
              Item1  Item2  Item3  Item4
   User A  [  5      3      ?      1   ]
   User B  [  4      ?      4      1   ]
   User C  [  ?      3      5      ?   ]
   
   Find similar users (cosine similarity) -> predict missing ratings

2. Content-Based:
   "Since you liked sci-fi movie X, here's another sci-fi movie"
   
   Item features -> similarity in feature space
   User profile = aggregation of liked items' features

3. Hybrid (most practical):
   Combine collaborative + content signals
   Deep learning models (Two-Tower, DeepFM, Wide & Deep)

Two-Stage Recommendation Architecture

  User opens app
       |
  +----v----------+
  | Candidate      |  Stage 1: Retrieve ~1000 candidates from millions
  | Generation     |  Methods: ANN search, collaborative filtering,
  | (Recall)       |           popularity, category-based
  +----+-----------+
       |  ~1000 candidates
  +----v----------+
  | Ranking        |  Stage 2: Score and rank 1000 candidates
  | (Precision)    |  Model: Deep neural network with rich features
  +----+-----------+  (user, item, context, cross features)
       |  ranked list
  +----v----------+
  | Re-ranking     |  Stage 3: Business rules, diversity, freshness
  | (Policy)       |  Remove duplicates, enforce content policy
  +----+-----------+  Inject sponsored content
       |
  +----v----------+
  | Final Top-K    |  Return 20-50 items to user
  +--------------+

9. Model Monitoring

                Model in Production
                       |
          +------------+------------+
          |            |            |
    +-----v-----+ +---v------+ +--v---------+
    | Data      | | Model    | | Business   |
    | Quality   | | Perf.    | | Metrics    |
    | Monitoring| | Monitor  | | Monitor    |
    +-----------+ +----------+ +------------+
    - Missing    - Prediction   - CTR
      features     distribution  - Revenue
    - Schema      shifts        - User
      violations  - Accuracy      engagement
    - Volume       degradation
      anomalies

Key Monitoring Concepts

ConceptWhat to TrackAction
Data driftFeature distributions shift from training dataRetrain model
Concept driftRelationship between features and labels changesRetrain with recent data
Model degradationOffline metrics (AUC) drop over timeTrigger retraining pipeline
Prediction distributionModel outputs shift (e.g., predicting 0 for everything)Alert, investigate
LatencyInference time exceeds budgetOptimize model, scale infrastructure
StalenessFeatures are too oldFix feature pipeline

Interview Tip

Always mention monitoring in ML system design. A model that isn't monitored will silently degrade. Mention that you'd track both model metrics (AUC, precision) AND business metrics (CTR, revenue) because a model can have great AUC but hurt business metrics due to feedback loops or population shift.


10. MLOps: CI/CD for ML

Traditional CI/CD:
  Code change -> Build -> Test -> Deploy

ML CI/CD:
  Code change ──┐
  Data change ──┼──> Validate -> Train -> Evaluate -> Deploy -> Monitor
  Config change ┘

Pipeline orchestration: Airflow, Kubeflow Pipelines, Metaflow
Model registry: MLflow, Vertex AI Model Registry
Experiment tracking: MLflow, Weights & Biases

MLOps Maturity Levels

LevelDescriptionAutomation
0ManualScientists train locally, hand-off model file to eng
1ML pipelineAutomated training pipeline, manual deployment
2CI/CD for MLAutomated training, testing, and deployment
3Full MLOpsAutomated monitoring, retraining triggers, canary deployment

11. Common ML System Design Interview Questions

QuestionKey Focus Areas
Design a recommendation engineTwo-stage (retrieval + ranking), feature store, A/B testing
Design ad click predictionReal-time inference, feature engineering, calibration, auction integration
Design content ranking (News Feed)Multi-objective ranking, freshness vs relevance, diversity
Design search rankingQuery understanding, document retrieval, learning-to-rank
Design content moderationMulti-modal (text, image, video), human-in-the-loop, precision vs recall trade-off
Design fraud detectionImbalanced data, real-time scoring, explainability for review
Design notification systemWhen/what to notify, user fatigue modeling, optimal send time
Design similar itemsEmbedding space, ANN (approximate nearest neighbor), cold start

Framework for Answering ML System Design

1. Clarify the Problem (2 min)
   - What are we optimizing? (clicks, revenue, engagement)
   - Scale: users, items, QPS
   - Latency requirements

2. Define Metrics (3 min)
   - Offline: AUC, NDCG, precision@K, recall@K
   - Online: CTR, conversion rate, session length, revenue
   - Guardrail: latency, diversity, fairness

3. High-Level Architecture (10 min)
   - Data pipeline -> Feature store -> Training -> Serving
   - Candidate generation -> Ranking -> Re-ranking
   - Online vs offline components

4. Deep Dive on Key Components (15 min)
   - Feature engineering (what signals matter?)
   - Model choice and training
   - Serving infrastructure (batch vs real-time)
   - Handling cold start

5. Monitoring & Iteration (5 min)
   - A/B testing setup
   - Model monitoring (drift, degradation)
   - Feedback loops

12. Key Trade-offs Discussion

DecisionOption AOption B
ServingBatch (simple, stale)Real-time (complex, fresh)
Model complexitySimple (logistic regression, fast)Deep model (better accuracy, slower)
FeaturesFew curated features (fast, interpretable)Many auto-features (better perf, harder to debug)
RetrainingDaily (stable, slightly stale)Continuous (freshest, complex infra)
Embedding searchExact (slow, perfect)ANN/HNSW (fast, approximate)
PersonalizationGlobal model (simpler)Per-user model (better, expensive)

13. Interview Checklist

  • Clarified optimization objective and scale
  • Described the full pipeline (data -> features -> training -> serving -> monitoring)
  • Explained feature store (online + offline, training-serving consistency)
  • Discussed model serving strategy (batch vs real-time vs hybrid)
  • Covered A/B testing infrastructure
  • Addressed recommendation system architecture (retrieval + ranking)
  • Mentioned model monitoring (data drift, concept drift, business metrics)
  • Discussed cold-start problem and solutions
  • Outlined MLOps (CI/CD for ML, automated retraining)

14. Resources

  • Book: "Designing Machine Learning Systems" by Chip Huyen (O'Reilly)
  • Book: "Machine Learning System Design Interview" by Ali Aminian & Alex Xu
  • Stanford CS 329S -- Machine Learning Systems Design (course materials online)
  • Eugune Yan's Blog -- eugeneyan.com (ML systems at Amazon)
  • Paper: "Wide & Deep Learning for Recommender Systems" (Google, 2016)
  • Paper: "Deep Neural Networks for YouTube Recommendations" (Google, 2016)
  • YouTube: Stanford MLSys Seminars
  • Feast documentation -- feast.dev (open-source feature store)

Previous: 45 - Geo-Distributed Systems | Next: 47 - Interview Strategy & Frameworks