What Are LLMs (Large Language Models)
The Simple Explanation
An LLM is a next-token prediction machine. Given some text, it predicts what comes next — like autocomplete, but trained on the entire internet and absurdly good at it.
Input: "The capital of France is"
Model: P(Paris) = 95%, P(Lyon) = 1.5%, P(a) = 0.8%, ...
Output: "Paris"
That's it. Every conversation, every code generation, every essay — it's all next-token prediction happening thousands of times in sequence.
How They're Built
Step 1: Architecture — The Transformer
The core innovation (Google, 2017 — "Attention Is All You Need" paper):
Input Text → Tokenize → Embeddings → [Transformer Layers × 96] → Probabilities → Next Token
↓
Self-Attention:
"Which words should I
pay attention to when
predicting the next one?"
Self-attention is the key mechanism: for each token, the model learns which OTHER tokens in the input are relevant. When processing "The cat sat on the ___", attention helps the model focus on "cat" and "sat" to predict "mat".
Step 2: Pre-Training
Feed the model trillions of tokens from the internet:
- Books, Wikipedia, code repositories, web pages, academic papers
- Training objective: predict the next token (self-supervised — no human labels needed)
- Cost: $10M-$100M+ in compute (thousands of GPUs for months)
- Result: a foundation model with broad knowledge
Step 3: Instruction Tuning (Fine-Tuning)
The raw pre-trained model just predicts text — it doesn't follow instructions well.
Fine-tune on instruction/response pairs:
Instruction: "Explain quantum computing in simple terms"
Response: "Quantum computing uses quantum bits (qubits) that can be..."
Now the model understands "when a human asks X, respond helpfully."
Step 4: RLHF (Reinforcement Learning from Human Feedback)
Humans rank model outputs from best to worst. Train a reward model from rankings. Use RL to make the LLM produce outputs the reward model scores highly.
This is how models become helpful, harmless, and honest — not just good at text prediction.
Pre-trained model → knows a lot but rambles
+ Instruction tuning → follows instructions
+ RLHF → actually helpful and safe
= Claude, GPT-4, etc.
What Parameters Mean
| Model | Parameters | Analogy |
|---|---|---|
| GPT-2 | 1.5B | Bicycle |
| Llama 3 8B | 8B | Car |
| Claude Sonnet | ~100B+ (estimated) | Airplane |
| GPT-4 / Claude Opus | ~400B-1T+ (estimated) | Rocket |
Parameters = the learned weights in the neural network. More parameters = more capacity to store patterns and knowledge. But also: more compute, more memory, more cost.
What LLMs Can and Can't Do
Can Do Well
- Generate fluent, coherent text
- Understand and follow complex instructions
- Write, debug, and explain code
- Translate between languages
- Summarize documents
- Reason through problems (with chain-of-thought)
- Roleplay, brainstorm, creative writing
Can't Do (Fundamental Limits)
| Limitation | Why |
|---|---|
| Hallucination | Model is confident about things it's wrong about — it generates "plausible" not "true" |
| Knowledge cutoff | Training data has a date — model doesn't know what happened after |
| No real-time info | Can't browse the web (unless given tools) |
| No persistent memory | Each conversation starts fresh (unless explicitly given context) |
| Math errors | Tokens aren't numbers — model "predicts" math results rather than calculating |
| Can't learn from conversations | Talking to it doesn't change its weights |
The Hallucination Problem
You: "Who won the 2028 Olympics?"
LLM: "The 2028 Olympics were held in Los Angeles..."
(confidently generates plausible-sounding but potentially wrong details)
The model doesn't "know" things — it generates text that looks like it should follow your input. Sometimes what looks right IS right. Sometimes it isn't.
The Current Landscape (2025-2026)
| Company | Models | Known For |
|---|---|---|
| Anthropic | Claude Opus 4, Sonnet 4, Haiku | Safety, coding, long context (1M tokens) |
| OpenAI | GPT-4o, o1, o3 | Pioneered the field, broad capabilities |
| Gemini 2.0, 2.5 | Multimodal, huge context (2M), integrated with Google | |
| Meta | Llama 3, 4 | Open-source leader |
| Mistral | Mistral Large, Codestral | European, efficient, strong at code |
| DeepSeek | DeepSeek-V3, R1 | Chinese, very cost-efficient, reasoning |
Key Insight for Using AI Well
LLMs are tools for augmenting your thinking, not replacing it.
- They're best when you can verify the output
- They're dangerous when you trust blindly
- The better YOUR prompt, the better THEIR output
- Treat them like a very fast, very knowledgeable junior dev — verify everything
Resources
- 🎥 3Blue1Brown — But What Is a Neural Network?
- 🎥 3Blue1Brown — Attention in Transformers
- 📖 Andrej Karpathy — Intro to LLMs (1hr talk)
- 🔗 The Illustrated Transformer
- 📖 "Attention Is All You Need" paper (Google, 2017)
Next: 02 - Tokens & Tokenization