Open Source Models
Why Open Source Matters
Closed-API models (Claude, GPT-4o) are powerful but come with trade-offs: your data leaves your machine, you pay per token, and you depend on a vendor. Open source models give you control.
┌─────────────────────────────────────────────────────────────┐
│ Reasons to use open source: │
│ │
│ 1. PRIVACY — Data never leaves your machine │
│ 2. COST — Free inference after hardware investment │
│ 3. CUSTOMIZATION — Fine-tune on your own data │
│ 4. NO VENDOR LOCK-IN — Switch models freely │
│ 5. OFFLINE — Works without internet │
│ 6. COMPLIANCE — Meet data residency requirements │
└─────────────────────────────────────────────────────────────┘
Major Open Source Models
Llama (Meta)
The most popular open source model family. Meta releases weights freely for research and commercial use.
| Model | Parameters | Context | Notes |
|---|---|---|---|
| Llama 3.3 70B | 70B | 128K | Best open-source general model |
| Llama 3.2 90B | 90B | 128K | Multimodal (vision) |
| Llama 3.1 405B | 405B | 128K | Largest open model, rivals GPT-4 class |
| Llama 3.2 8B | 8B | 128K | Great for local use, fast |
| Llama 3.2 1B/3B | 1B/3B | 128K | Mobile and edge devices |
Why it's popular: Best balance of size, capability, and license terms. Huge community with fine-tuned variants for every use case.
Mistral
European AI lab. Known for efficient, high-performing models.
| Model | Parameters | Notes |
|---|---|---|
| Mistral Large | 123B | Competitive with GPT-4o class |
| Mixtral 8x22B | 176B (MoE) | Mixture-of-Experts, only 44B active per token |
| Mistral 7B | 7B | Punches above its weight |
| Codestral | 22B | Specialized for code generation |
| Mistral Small | 22B | Efficient for common tasks |
What's MoE (Mixture of Experts)? The model has 176B total parameters but only activates 44B for each token. This means near-large-model quality at medium-model speed. Think of it as 8 specialist sub-models that hand off to whichever is most relevant.
DeepSeek
Chinese AI lab. Known for extremely cost-efficient training and strong reasoning.
| Model | Parameters | Notes |
|---|---|---|
| DeepSeek-V3 | 671B (MoE) | Strong general model, 37B active |
| DeepSeek-R1 | 671B (MoE) | Reasoning model (like o1), shows chain-of-thought |
| DeepSeek-Coder-V2 | 236B (MoE) | Specialized for code |
Why notable: DeepSeek-R1 was trained at a fraction of the cost of comparable models and performs competitively with o1 on reasoning benchmarks. Open weights.
Gemma (Google)
Google's open models. Small but well-optimized.
| Model | Parameters | Notes |
|---|---|---|
| Gemma 2 27B | 27B | Best in its size class |
| Gemma 2 9B | 9B | Good general purpose |
| Gemma 2 2B | 2B | Tiny, good for experimentation |
Best for: Developers who want a small, capable model from a trusted source.
Phi (Microsoft)
Microsoft's small language models. Surprisingly capable for their size.
| Model | Parameters | Notes |
|---|---|---|
| Phi-4 | 14B | Strong reasoning for its size |
| Phi-3.5 Mini | 3.8B | Runs on phones, beats many 7B models |
| Phi-3.5 MoE | 42B (MoE) | Mixture of experts, efficient |
Why notable: Phi models demonstrate that training data quality matters more than raw size. They consistently outperform larger models on benchmarks.
Qwen (Alibaba)
Alibaba's model family. Strong multilingual support and competitive performance.
| Model | Parameters | Notes |
|---|---|---|
| Qwen2.5 72B | 72B | Competitive with Llama 3.1 70B |
| Qwen2.5-Coder 32B | 32B | Very strong code model |
| Qwen2.5 7B | 7B | Good for local use |
| QwQ-32B | 32B | Reasoning model (like R1) |
Best for: Multilingual applications, especially Chinese-English. Qwen-Coder is one of the best open code models.
Running Models Locally
Ollama (Recommended for Beginners)
The easiest way to run open models on your machine:
bash# Install curl -fsSL https://ollama.ai/install.sh | sh # Run a model (downloads automatically on first use) ollama run llama3.2 # 8B model, ~5GB download ollama run codestral # 22B code model ollama run deepseek-r1:8b # 8B reasoning model # Use as an API (OpenAI-compatible) curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2", "messages": [{"role": "user", "content": "Hello!"}] }'
Why Ollama: One command to download and run. Exposes an OpenAI-compatible API, so tools like Cline and Continue work with it out of the box.
LM Studio
Desktop app with a GUI. Good for non-terminal users:
- Browse and download models from a catalog
- Chat interface built in
- Local API server (OpenAI-compatible)
- Adjust parameters (temperature, context length) visually
llama.cpp
The underlying engine that Ollama and many other tools use. For maximum control:
bash# Build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make # Run a model ./main -m models/llama-3.2-8b-q4_K_M.gguf \ -p "Write a Python function to sort a list" \ -n 256
When to use: Custom deployments, embedding in applications, maximum performance tuning.
vLLM
High-performance serving for production deployments:
bashpip install vllm # Serve a model with OpenAI-compatible API python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-8B-Instruct \ --port 8000
When to use: Serving models to multiple users, production APIs, when you need high throughput.
Hardware Requirements
| Model Size | RAM Needed (FP16) | RAM Needed (Q4 quantized) | GPU |
|---|---|---|---|
| 1-3B | 4-8 GB | 2-4 GB | Optional (CPU works) |
| 7-8B | 16 GB | 6-8 GB | 8GB VRAM recommended |
| 13B | 32 GB | 10-12 GB | 12GB VRAM recommended |
| 34B | 68 GB | 20-24 GB | 24GB VRAM (RTX 4090) |
| 70B | 140 GB | 40-48 GB | 2x 24GB GPUs or 48GB+ |
| 405B | 810 GB | 200+ GB | Multi-GPU server |
Practical guidelines:
- MacBook with 16GB RAM: Run 7-8B models comfortably
- MacBook with 32GB RAM: Run 13B models, try 34B quantized
- MacBook with 64GB+ RAM: Run 70B models quantized
- Gaming PC with RTX 4090 (24GB): Run up to 34B models with GPU acceleration
- No GPU? CPU inference works for 7-8B models, just slower (~5-15 tokens/sec)
Quantization
Quantization reduces model size by using lower precision numbers. It trades a small amount of quality for significant memory and speed gains.
Full precision (FP16): Each parameter = 16 bits
70B model = ~140 GB
INT8 quantization: Each parameter = 8 bits
70B model = ~70 GB
Quality: ~98% of original
INT4 quantization: Each parameter = 4 bits
70B model = ~35 GB
Quality: ~93-95% of original
Common Quantization Formats
| Format | Size | Quality | When to Use |
|---|---|---|---|
| FP16 | 2 bytes/param | 100% (baseline) | When you have enough VRAM |
| Q8_0 | 1 byte/param | ~98% | Good balance of quality and size |
| Q5_K_M | ~0.65 bytes/param | ~96% | Sweet spot for most users |
| Q4_K_M | ~0.5 bytes/param | ~93-95% | Best for limited hardware |
| Q3_K_M | ~0.4 bytes/param | ~90% | When you really need it to fit |
| Q2_K | ~0.3 bytes/param | ~85% | Quality drops noticeably |
Rule of thumb: Q4_K_M is the sweet spot. Below Q3, quality degrades noticeably. Prefer running a smaller model at higher quantization over a larger model at extreme quantization.
When to Use Open Source vs API
| Consideration | Open Source | API (Claude, GPT-4o) |
|---|---|---|
| Privacy | Data stays local | Data sent to provider |
| Cost at scale | Fixed hardware cost | Per-token pricing adds up |
| Cost at low volume | Hardware investment | Pay only for what you use |
| Quality (general) | Good, not best | State-of-the-art |
| Quality (after fine-tune) | Can exceed APIs for narrow tasks | No fine-tuning on frontier models |
| Speed to start | Need hardware + setup | API key and go |
| Offline use | Works anywhere | Needs internet |
| Customization | Full control | Limited to prompting |
| Maintenance | You manage updates, hardware | Provider handles everything |
| Compliance | Full data control | Depends on provider's DPA |
Decision Framework
Is privacy critical (healthcare, legal, defense)?
YES → Open source (local deployment)
Is the task narrow and high-volume (classification, extraction)?
YES → Fine-tuned open source model (cheaper at scale)
Do you need maximum quality on complex reasoning?
YES → API (Claude Opus, o3)
Are you prototyping or low volume?
YES → API (faster to start, pay per use)
Do you need to work offline?
YES → Open source
Budget for 1M+ API calls/month?
NO → Open source
YES → API (if quality matters more than cost)
The Practical Starting Point
If you're new to open source models:
- Install Ollama — one command, works on Mac/Linux/Windows
- Run
ollama run llama3.2— download and chat with an 8B model - Connect to your tools — Cline, Continue, or any OpenAI-compatible client can point to
localhost:11434 - Try different models —
ollama run codestralfor code,ollama run deepseek-r1:8bfor reasoning - Scale up if needed — bigger models, GPU acceleration, vLLM for production
Resources
- 🔗 Ollama
- 🔗 LM Studio
- 🔗 llama.cpp
- 🔗 vLLM
- 🔗 Hugging Face Open LLM Leaderboard
- 🔗 Meta Llama
- 🔗 Mistral AI
- 🔗 DeepSeek
- 📄 GGUF Format Explanation
Previous: 12 - AI Coding Assistants | Next: 14 - AI APIs & SDKs