26 - Security in Distributed Systems

Previous: 25 - Monitoring, Logging & Tracing | Next: 27 - Design URL Shortener

Why This Matters in Interviews

Security is a first-class concern in every FAANG system design. Interviewers probe: "How does Service A trust Service B?", "Where do you store secrets?", "How do you prevent token theft?" Demonstrating security thinking elevates your answer from good to great.

Authentication vs Authorization

Authentication (AuthN):  WHO are you?     --> Identity verification
Authorization  (AuthZ):  WHAT can you do? --> Permission checking

Flow:
  User --> [Authenticate: verify identity] --> [Authorize: check permissions] --> Resource

Aspect	Authentication	Authorization
Question	"Who is this?"	"Can they do this?"
Mechanism	Passwords, tokens, certificates	Roles, policies, ACLs
Failure	401 Unauthorized	403 Forbidden
Happens	First	After authentication

OAuth 2.0

OAuth 2.0 is a delegation framework -- it lets a third-party app access resources on behalf of a user without sharing credentials.

Authorization Code Flow (Most Common for Web Apps)

+--------+                               +---------------+
|        |---(1) Authorization Request-->|               |
|        |                               | Authorization |
| Client |<--(2) Authorization Code-----|    Server     |
| (App)  |                               |               |
|        |---(3) Code + Client Secret--->|               |
|        |<--(4) Access Token + Refresh--|               |
+--------+                               +---------------+
     |                                          |
     |---(5) API Request with Access Token----->|
     |                                   +------+------+
     |                                   |   Resource   |
     |<--(6) Protected Resource---------|    Server    |
     |                                   +-------------+

Client Credentials Flow (Service-to-Service)

+----------+                          +---------------+
|          |---(1) Client ID +------->|               |
|  Service |     Client Secret        | Authorization |
|    A     |<--(2) Access Token ------|    Server     |
|          |                          +---------------+
|          |---(3) Call Service B with token---------->|
+----------+                                    +-----+-----+
                                                | Service B  |
                                                +-----------+

No user involved. Used for backend-to-backend communication.

OAuth 2.0 Flow Comparison

Flow	Use Case	Involves User?	Secrets on Client?
Authorization Code	Web apps, mobile (with PKCE)	Yes	Server-side only
Client Credentials	Service-to-service	No	Yes (server-to-server)
Implicit (deprecated)	Legacy SPAs	Yes	No (tokens in URL)
Device Code	Smart TVs, CLI tools	Yes (on separate device)	No

OpenID Connect (OIDC)

OIDC is an identity layer on top of OAuth 2.0. OAuth gives you an access token (authorization); OIDC also gives you an ID token (authentication).

OAuth 2.0:   "Here's a token to access the user's photos"  (AuthZ)
OIDC:        "Here's proof that this is user@example.com"   (AuthN + AuthZ)

ID Token = JWT containing claims about the user (sub, email, name, etc.)

JWT (JSON Web Tokens)

Structure

Header.Payload.Signature

eyJhbGciOiJSUzI1NiJ9.eyJzdWIiOiIxMjM0NSIsImVtYWlsIjoiam9obkBleC5jb20ifQ.signature
|_____________________||_____________________________________________||_________|
       Header                          Payload                        Signature

Part	Contains	Encoded
Header	Algorithm (RS256), token type (JWT)	Base64url
Payload	Claims: sub, exp, iat, iss, custom data	Base64url
Signature	HMAC or RSA signature over header + payload	Binary

Pros and Cons

Pros	Cons
Stateless (no server-side session lookup)	Cannot be revoked individually (until expiry)
Contains claims (roles, user info)	Payload is readable (Base64, not encrypted)
Works across services (shared secret or public key)	Larger than opaque tokens
Standard format (interoperable)	Clock skew issues with expiry

Refresh Token Pattern

+--------+                         +----------+
| Client |---(1) Login----------->| Auth      |
|        |<--(2) Access (15min) + | Server    |
|        |       Refresh (7d)-----|           |
|        |                        +----------+
|        |---(3) API call with Access Token-->| API |
|        |                                     |     |
|        |  (Access token expires)             |     |
|        |                                     +-----+
|        |---(4) Refresh Token--------------->| Auth  |
|        |<--(5) New Access + New Refresh-----|       |
+--------+                                    +-------+

Security rules for refresh tokens:

Store securely (httpOnly cookie, not localStorage)
Rotate on every use (detect theft if old one is reused)
Bind to device/IP when possible
Shorter lifetimes for sensitive applications

API Key Management

Concern	Best Practice
Storage	Hash API keys before storing (like passwords)
Transmission	Always over HTTPS, in headers not URLs
Rotation	Support multiple active keys for zero-downtime rotation
Scoping	Limit keys to specific endpoints/permissions
Rate limiting	Tie rate limits to API keys
Revocation	Immediate revocation capability
Logging	Log key usage but never log the key itself

mTLS (Mutual TLS)

Standard TLS: client verifies server. mTLS: both sides verify each other.

Standard TLS:
  Client ---[verify server cert]---> Server
  Client <------[encrypted]-------> Server

Mutual TLS:
  Client ---[verify server cert]---> Server
  Client <--[verify client cert]---- Server
  Client <------[encrypted]-------> Server

Use case: Service-to-service communication in microservices (service mesh).

Service A (client cert) <--mTLS--> Service B (server cert)

Managed by service mesh (Istio, Linkerd):
  - Automatic certificate issuance and rotation
  - Zero application code changes
  - Every connection authenticated and encrypted

Benefits:

Strong identity for every service (not just "someone with a valid token")
Encryption in transit by default
No shared secrets to manage

Zero Trust Architecture

Traditional (castle-and-moat):
  [Firewall] --> Trusted internal network --> Services trust each other

Zero Trust:
  Every request authenticated + authorized, regardless of network location
  "Never trust, always verify"

Principles

Principle	Implementation
Verify explicitly	AuthN + AuthZ on every request
Least privilege	Minimal permissions, scoped tokens
Assume breach	Encrypt everything, segment networks, monitor all

Zero Trust in Practice

Request --> [Identity Proxy/Gateway]
              |
              +--> Authenticate (JWT, mTLS, device cert)
              +--> Authorize (policy engine: OPA, Zanzibar)
              +--> Check device health
              +--> Check network context
              +--> Allow/Deny
              |
              v
           Service

RBAC vs ABAC

RBAC (Role-Based Access Control)

User --> has Role --> Role has Permissions

Example:
  alice --> "editor" --> [create_post, edit_post, delete_own_post]
  bob   --> "admin"  --> [create_post, edit_post, delete_any_post, manage_users]

ABAC (Attribute-Based Access Control)

Policy: ALLOW if user.department == resource.department
                AND user.clearance >= resource.classification
                AND time.current BETWEEN 09:00 AND 18:00

Comparison

Aspect	RBAC	ABAC
Complexity	Simple	Complex
Granularity	Coarse (role-level)	Fine (attribute-level)
Scalability	Role explosion with complex rules	Scales with policy engine
Auditability	Easy (who has what role?)	Harder (policy evaluation)
Best for	Most applications	Multi-tenant, regulatory, context-dependent

Google Zanzibar: A relationship-based access control system (ReBAC) used at Google scale. Inspired open-source projects like SpiceDB, Ory Keto.

Rate Limiting for Security

Beyond traffic management, rate limiting is a security control:

Attack	Rate Limiting Defense
Brute force login	Max 5 attempts per account per 15 min
Credential stuffing	Max 100 login attempts per IP per hour
API abuse	Per-key request limits
Scraping	Per-IP + per-session limits
DDoS	Global rate limits + geographic filtering

Encryption

Three Layers

At Rest:        Data on disk encrypted (AES-256)
                [Disk] --> [Encrypted blocks]

In Transit:     Data between services encrypted (TLS 1.3)
                Service A --[TLS]--> Service B

End-to-End:     Data encrypted by sender, decrypted only by recipient
                User A --[E2E encrypted]--> User B
                (Server cannot read the data)

Type	Protects Against	Implementation
At rest	Disk theft, unauthorized DB access	AES-256, KMS-managed keys
In transit	Network sniffing, MITM	TLS 1.3, mTLS
End-to-end	Server compromise, insider threat	Client-side encryption (Signal protocol)

Key Management

+--------+     wraps      +----------+     wraps      +--------+
|  Root  | ------------> |  Master  | ------------> |  Data  |
|  Key   |   (HSM/KMS)   |   Key    |   (envelope)   |  Key   |
| (never |               | (per     |               | (per   |
|  leaves|               |  service)|               |  record)|
|  HSM)  |               +----------+               +--------+
+--------+

Envelope encryption: Encrypt data with a data key, encrypt the data key with a master key. Only the encrypted data key is stored alongside the data.

Secrets Management

HashiCorp Vault

+----------+                          +-------+
| Service  |---(1) Authenticate------>| Vault |
|          |   (AppRole, K8s SA,      |       |
|          |    AWS IAM)              |       |
|          |<--(2) Lease + Secret-----|       |
|          |   (DB creds, API key,    |       |
|          |    TLS cert)             |       |
+----------+                          +-------+
                                         |
                                    [Auto-rotate]
                                    [Audit log]
                                    [Lease expiry]

Never do:

Secrets in source code or config files committed to git
Secrets in environment variables without a secrets manager
Shared secrets across environments (dev key = prod key)

Common Attack Vectors in Distributed Systems

Attack	Description	Mitigation
Man-in-the-Middle (MITM)	Attacker intercepts communication between services	mTLS, certificate pinning
Replay attack	Attacker captures and re-sends a valid request	Nonces, timestamps, idempotency keys
Token theft	Stolen JWT/session used to impersonate user	Short expiry, token binding, refresh rotation
Confused deputy	Service tricked into misusing its authority	Scoped tokens, request validation
Privilege escalation	User gains higher permissions than intended	Least privilege, input validation on role claims
Supply chain attack	Compromised dependency/container image	Image signing, SBOMs, dependency scanning
Insider threat	Malicious employee with access	Audit logging, least privilege, secrets rotation
DNS hijacking	Traffic redirected to malicious server	DNSSEC, certificate transparency

Interview Tips

JWT trade-offs are a classic question. Know the revocation problem and when to use opaque tokens instead.
mTLS for service-to-service is expected at FAANG. Mention service mesh (Istio) for automated cert management.
Zero trust > perimeter security. Show you think beyond firewalls.
"Where do you store secrets?" Never in code. Vault, KMS, or managed secrets service.
OAuth 2.0 flows come up in API design. Know which flow for which client type.
Encryption layers -- mention all three (rest, transit, E2E) and when each applies.

Common Interview Questions

"How do microservices authenticate with each other?"
"Design an authentication system for a multi-tenant SaaS"
"What happens when a JWT is stolen? How do you mitigate?"
"Explain the difference between OAuth 2.0 and OpenID Connect"
"How would you implement fine-grained authorization at scale?"

Resources

DDIA Chapter 4: Encoding (touches on security of data formats)
OAuth 2.0 Simplified - Aaron Parecki
Zanzibar Paper: Google's global authorization system
Auth0 blog: "Which OAuth 2.0 Flow Should I Use?"
HashiCorp Vault documentation
NIST Zero Trust Architecture (SP 800-207)
OWASP Top 10 for APIs

Previous: 25 - Monitoring, Logging & Tracing | Next: 27 - Design URL Shortener