Open Source Models

Why Open Source Matters

Closed-API models (Claude, GPT-4o) are powerful but come with trade-offs: your data leaves your machine, you pay per token, and you depend on a vendor. Open source models give you control.

┌─────────────────────────────────────────────────────────────┐
│ Reasons to use open source:                                  │
│                                                              │
│ 1. PRIVACY     — Data never leaves your machine              │
│ 2. COST        — Free inference after hardware investment     │
│ 3. CUSTOMIZATION — Fine-tune on your own data                │
│ 4. NO VENDOR LOCK-IN — Switch models freely                  │
│ 5. OFFLINE     — Works without internet                      │
│ 6. COMPLIANCE  — Meet data residency requirements            │
└─────────────────────────────────────────────────────────────┘

Major Open Source Models

Llama (Meta)

The most popular open source model family. Meta releases weights freely for research and commercial use.

ModelParametersContextNotes
Llama 3.3 70B70B128KBest open-source general model
Llama 3.2 90B90B128KMultimodal (vision)
Llama 3.1 405B405B128KLargest open model, rivals GPT-4 class
Llama 3.2 8B8B128KGreat for local use, fast
Llama 3.2 1B/3B1B/3B128KMobile and edge devices

Why it's popular: Best balance of size, capability, and license terms. Huge community with fine-tuned variants for every use case.

Mistral

European AI lab. Known for efficient, high-performing models.

ModelParametersNotes
Mistral Large123BCompetitive with GPT-4o class
Mixtral 8x22B176B (MoE)Mixture-of-Experts, only 44B active per token
Mistral 7B7BPunches above its weight
Codestral22BSpecialized for code generation
Mistral Small22BEfficient for common tasks

What's MoE (Mixture of Experts)? The model has 176B total parameters but only activates 44B for each token. This means near-large-model quality at medium-model speed. Think of it as 8 specialist sub-models that hand off to whichever is most relevant.

DeepSeek

Chinese AI lab. Known for extremely cost-efficient training and strong reasoning.

ModelParametersNotes
DeepSeek-V3671B (MoE)Strong general model, 37B active
DeepSeek-R1671B (MoE)Reasoning model (like o1), shows chain-of-thought
DeepSeek-Coder-V2236B (MoE)Specialized for code

Why notable: DeepSeek-R1 was trained at a fraction of the cost of comparable models and performs competitively with o1 on reasoning benchmarks. Open weights.

Gemma (Google)

Google's open models. Small but well-optimized.

ModelParametersNotes
Gemma 2 27B27BBest in its size class
Gemma 2 9B9BGood general purpose
Gemma 2 2B2BTiny, good for experimentation

Best for: Developers who want a small, capable model from a trusted source.

Phi (Microsoft)

Microsoft's small language models. Surprisingly capable for their size.

ModelParametersNotes
Phi-414BStrong reasoning for its size
Phi-3.5 Mini3.8BRuns on phones, beats many 7B models
Phi-3.5 MoE42B (MoE)Mixture of experts, efficient

Why notable: Phi models demonstrate that training data quality matters more than raw size. They consistently outperform larger models on benchmarks.

Qwen (Alibaba)

Alibaba's model family. Strong multilingual support and competitive performance.

ModelParametersNotes
Qwen2.5 72B72BCompetitive with Llama 3.1 70B
Qwen2.5-Coder 32B32BVery strong code model
Qwen2.5 7B7BGood for local use
QwQ-32B32BReasoning model (like R1)

Best for: Multilingual applications, especially Chinese-English. Qwen-Coder is one of the best open code models.


Running Models Locally

Ollama (Recommended for Beginners)

The easiest way to run open models on your machine:

bash
# Install curl -fsSL https://ollama.ai/install.sh | sh # Run a model (downloads automatically on first use) ollama run llama3.2 # 8B model, ~5GB download ollama run codestral # 22B code model ollama run deepseek-r1:8b # 8B reasoning model # Use as an API (OpenAI-compatible) curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2", "messages": [{"role": "user", "content": "Hello!"}] }'

Why Ollama: One command to download and run. Exposes an OpenAI-compatible API, so tools like Cline and Continue work with it out of the box.

LM Studio

Desktop app with a GUI. Good for non-terminal users:

  • Browse and download models from a catalog
  • Chat interface built in
  • Local API server (OpenAI-compatible)
  • Adjust parameters (temperature, context length) visually

llama.cpp

The underlying engine that Ollama and many other tools use. For maximum control:

bash
# Build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make # Run a model ./main -m models/llama-3.2-8b-q4_K_M.gguf \ -p "Write a Python function to sort a list" \ -n 256

When to use: Custom deployments, embedding in applications, maximum performance tuning.

vLLM

High-performance serving for production deployments:

bash
pip install vllm # Serve a model with OpenAI-compatible API python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-8B-Instruct \ --port 8000

When to use: Serving models to multiple users, production APIs, when you need high throughput.


Hardware Requirements

Model SizeRAM Needed (FP16)RAM Needed (Q4 quantized)GPU
1-3B4-8 GB2-4 GBOptional (CPU works)
7-8B16 GB6-8 GB8GB VRAM recommended
13B32 GB10-12 GB12GB VRAM recommended
34B68 GB20-24 GB24GB VRAM (RTX 4090)
70B140 GB40-48 GB2x 24GB GPUs or 48GB+
405B810 GB200+ GBMulti-GPU server

Practical guidelines:

  • MacBook with 16GB RAM: Run 7-8B models comfortably
  • MacBook with 32GB RAM: Run 13B models, try 34B quantized
  • MacBook with 64GB+ RAM: Run 70B models quantized
  • Gaming PC with RTX 4090 (24GB): Run up to 34B models with GPU acceleration
  • No GPU? CPU inference works for 7-8B models, just slower (~5-15 tokens/sec)

Quantization

Quantization reduces model size by using lower precision numbers. It trades a small amount of quality for significant memory and speed gains.

Full precision (FP16):  Each parameter = 16 bits
                        70B model = ~140 GB

INT8 quantization:      Each parameter = 8 bits
                        70B model = ~70 GB
                        Quality: ~98% of original

INT4 quantization:      Each parameter = 4 bits
                        70B model = ~35 GB
                        Quality: ~93-95% of original

Common Quantization Formats

FormatSizeQualityWhen to Use
FP162 bytes/param100% (baseline)When you have enough VRAM
Q8_01 byte/param~98%Good balance of quality and size
Q5_K_M~0.65 bytes/param~96%Sweet spot for most users
Q4_K_M~0.5 bytes/param~93-95%Best for limited hardware
Q3_K_M~0.4 bytes/param~90%When you really need it to fit
Q2_K~0.3 bytes/param~85%Quality drops noticeably

Rule of thumb: Q4_K_M is the sweet spot. Below Q3, quality degrades noticeably. Prefer running a smaller model at higher quantization over a larger model at extreme quantization.


When to Use Open Source vs API

ConsiderationOpen SourceAPI (Claude, GPT-4o)
PrivacyData stays localData sent to provider
Cost at scaleFixed hardware costPer-token pricing adds up
Cost at low volumeHardware investmentPay only for what you use
Quality (general)Good, not bestState-of-the-art
Quality (after fine-tune)Can exceed APIs for narrow tasksNo fine-tuning on frontier models
Speed to startNeed hardware + setupAPI key and go
Offline useWorks anywhereNeeds internet
CustomizationFull controlLimited to prompting
MaintenanceYou manage updates, hardwareProvider handles everything
ComplianceFull data controlDepends on provider's DPA

Decision Framework

Is privacy critical (healthcare, legal, defense)?
  YES → Open source (local deployment)

Is the task narrow and high-volume (classification, extraction)?
  YES → Fine-tuned open source model (cheaper at scale)

Do you need maximum quality on complex reasoning?
  YES → API (Claude Opus, o3)

Are you prototyping or low volume?
  YES → API (faster to start, pay per use)

Do you need to work offline?
  YES → Open source

Budget for 1M+ API calls/month?
  NO → Open source
  YES → API (if quality matters more than cost)

The Practical Starting Point

If you're new to open source models:

  1. Install Ollama — one command, works on Mac/Linux/Windows
  2. Run ollama run llama3.2 — download and chat with an 8B model
  3. Connect to your tools — Cline, Continue, or any OpenAI-compatible client can point to localhost:11434
  4. Try different modelsollama run codestral for code, ollama run deepseek-r1:8b for reasoning
  5. Scale up if needed — bigger models, GPU acceleration, vLLM for production

Resources


Previous: 12 - AI Coding Assistants | Next: 14 - AI APIs & SDKs