Open Source Models

Why Open Source Matters

Closed-API models (Claude, GPT-4o) are powerful but come with trade-offs: your data leaves your machine, you pay per token, and you depend on a vendor. Open source models give you control.

┌─────────────────────────────────────────────────────────────┐
│ Reasons to use open source:                                  │
│                                                              │
│ 1. PRIVACY     — Data never leaves your machine              │
│ 2. COST        — Free inference after hardware investment     │
│ 3. CUSTOMIZATION — Fine-tune on your own data                │
│ 4. NO VENDOR LOCK-IN — Switch models freely                  │
│ 5. OFFLINE     — Works without internet                      │
│ 6. COMPLIANCE  — Meet data residency requirements            │
└─────────────────────────────────────────────────────────────┘

Major Open Source Models

Llama (Meta)

The most popular open source model family. Meta releases weights freely for research and commercial use.

Model	Parameters	Context	Notes
Llama 3.3 70B	70B	128K	Best open-source general model
Llama 3.2 90B	90B	128K	Multimodal (vision)
Llama 3.1 405B	405B	128K	Largest open model, rivals GPT-4 class
Llama 3.2 8B	8B	128K	Great for local use, fast
Llama 3.2 1B/3B	1B/3B	128K	Mobile and edge devices

Why it's popular: Best balance of size, capability, and license terms. Huge community with fine-tuned variants for every use case.

Mistral

European AI lab. Known for efficient, high-performing models.

Model	Parameters	Notes
Mistral Large	123B	Competitive with GPT-4o class
Mixtral 8x22B	176B (MoE)	Mixture-of-Experts, only 44B active per token
Mistral 7B	7B	Punches above its weight
Codestral	22B	Specialized for code generation
Mistral Small	22B	Efficient for common tasks

What's MoE (Mixture of Experts)? The model has 176B total parameters but only activates 44B for each token. This means near-large-model quality at medium-model speed. Think of it as 8 specialist sub-models that hand off to whichever is most relevant.

DeepSeek

Chinese AI lab. Known for extremely cost-efficient training and strong reasoning.

Model	Parameters	Notes
DeepSeek-V3	671B (MoE)	Strong general model, 37B active
DeepSeek-R1	671B (MoE)	Reasoning model (like o1), shows chain-of-thought
DeepSeek-Coder-V2	236B (MoE)	Specialized for code

Why notable: DeepSeek-R1 was trained at a fraction of the cost of comparable models and performs competitively with o1 on reasoning benchmarks. Open weights.

Gemma (Google)

Google's open models. Small but well-optimized.

Model	Parameters	Notes
Gemma 2 27B	27B	Best in its size class
Gemma 2 9B	9B	Good general purpose
Gemma 2 2B	2B	Tiny, good for experimentation

Best for: Developers who want a small, capable model from a trusted source.

Phi (Microsoft)

Microsoft's small language models. Surprisingly capable for their size.

Model	Parameters	Notes
Phi-4	14B	Strong reasoning for its size
Phi-3.5 Mini	3.8B	Runs on phones, beats many 7B models
Phi-3.5 MoE	42B (MoE)	Mixture of experts, efficient

Why notable: Phi models demonstrate that training data quality matters more than raw size. They consistently outperform larger models on benchmarks.

Qwen (Alibaba)

Alibaba's model family. Strong multilingual support and competitive performance.

Model	Parameters	Notes
Qwen2.5 72B	72B	Competitive with Llama 3.1 70B
Qwen2.5-Coder 32B	32B	Very strong code model
Qwen2.5 7B	7B	Good for local use
QwQ-32B	32B	Reasoning model (like R1)

Best for: Multilingual applications, especially Chinese-English. Qwen-Coder is one of the best open code models.

Running Models Locally

Ollama (Recommended for Beginners)

The easiest way to run open models on your machine:

bash
# Install
curl -fsSL https://ollama.ai/install.sh | sh

# Run a model (downloads automatically on first use)
ollama run llama3.2        # 8B model, ~5GB download
ollama run codestral       # 22B code model
ollama run deepseek-r1:8b  # 8B reasoning model

# Use as an API (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Why Ollama: One command to download and run. Exposes an OpenAI-compatible API, so tools like Cline and Continue work with it out of the box.

LM Studio

Desktop app with a GUI. Good for non-terminal users:

Browse and download models from a catalog
Chat interface built in
Local API server (OpenAI-compatible)
Adjust parameters (temperature, context length) visually

llama.cpp

The underlying engine that Ollama and many other tools use. For maximum control:

bash
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Run a model
./main -m models/llama-3.2-8b-q4_K_M.gguf \
       -p "Write a Python function to sort a list" \
       -n 256

When to use: Custom deployments, embedding in applications, maximum performance tuning.

vLLM

High-performance serving for production deployments:

bash
pip install vllm

# Serve a model with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-8B-Instruct \
  --port 8000

When to use: Serving models to multiple users, production APIs, when you need high throughput.

Hardware Requirements

Model Size	RAM Needed (FP16)	RAM Needed (Q4 quantized)	GPU
1-3B	4-8 GB	2-4 GB	Optional (CPU works)
7-8B	16 GB	6-8 GB	8GB VRAM recommended
13B	32 GB	10-12 GB	12GB VRAM recommended
34B	68 GB	20-24 GB	24GB VRAM (RTX 4090)
70B	140 GB	40-48 GB	2x 24GB GPUs or 48GB+
405B	810 GB	200+ GB	Multi-GPU server

Practical guidelines:

MacBook with 16GB RAM: Run 7-8B models comfortably
MacBook with 32GB RAM: Run 13B models, try 34B quantized
MacBook with 64GB+ RAM: Run 70B models quantized
Gaming PC with RTX 4090 (24GB): Run up to 34B models with GPU acceleration
No GPU? CPU inference works for 7-8B models, just slower (~5-15 tokens/sec)

Quantization

Quantization reduces model size by using lower precision numbers. It trades a small amount of quality for significant memory and speed gains.

Full precision (FP16):  Each parameter = 16 bits
                        70B model = ~140 GB

INT8 quantization:      Each parameter = 8 bits
                        70B model = ~70 GB
                        Quality: ~98% of original

INT4 quantization:      Each parameter = 4 bits
                        70B model = ~35 GB
                        Quality: ~93-95% of original

Common Quantization Formats

Format	Size	Quality	When to Use
FP16	2 bytes/param	100% (baseline)	When you have enough VRAM
Q8_0	1 byte/param	~98%	Good balance of quality and size
Q5_K_M	~0.65 bytes/param	~96%	Sweet spot for most users
Q4_K_M	~0.5 bytes/param	~93-95%	Best for limited hardware
Q3_K_M	~0.4 bytes/param	~90%	When you really need it to fit
Q2_K	~0.3 bytes/param	~85%	Quality drops noticeably

Rule of thumb: Q4_K_M is the sweet spot. Below Q3, quality degrades noticeably. Prefer running a smaller model at higher quantization over a larger model at extreme quantization.

When to Use Open Source vs API

Consideration	Open Source	API (Claude, GPT-4o)
Privacy	Data stays local	Data sent to provider
Cost at scale	Fixed hardware cost	Per-token pricing adds up
Cost at low volume	Hardware investment	Pay only for what you use
Quality (general)	Good, not best	State-of-the-art
Quality (after fine-tune)	Can exceed APIs for narrow tasks	No fine-tuning on frontier models
Speed to start	Need hardware + setup	API key and go
Offline use	Works anywhere	Needs internet
Customization	Full control	Limited to prompting
Maintenance	You manage updates, hardware	Provider handles everything
Compliance	Full data control	Depends on provider's DPA

Decision Framework

Is privacy critical (healthcare, legal, defense)?
  YES → Open source (local deployment)

Is the task narrow and high-volume (classification, extraction)?
  YES → Fine-tuned open source model (cheaper at scale)

Do you need maximum quality on complex reasoning?
  YES → API (Claude Opus, o3)

Are you prototyping or low volume?
  YES → API (faster to start, pay per use)

Do you need to work offline?
  YES → Open source

Budget for 1M+ API calls/month?
  NO → Open source
  YES → API (if quality matters more than cost)

The Practical Starting Point

If you're new to open source models:

Install Ollama — one command, works on Mac/Linux/Windows
Run ollama run llama3.2 — download and chat with an 8B model
Connect to your tools — Cline, Continue, or any OpenAI-compatible client can point to localhost:11434
Try different models — ollama run codestral for code, ollama run deepseek-r1:8b for reasoning
Scale up if needed — bigger models, GPU acceleration, vLLM for production

Resources

Previous: 12 - AI Coding Assistants | Next: 14 - AI APIs & SDKs