Models & Quantization
This page explains what AI models are, how they're measured, and the techniques that make running them locally possible.
Understanding Model Sizes
When you see a model name like "Llama 3.2 3B" or "Qwen 2.5 14B," the number refers to parameters, the model's "knowledge" encoded as numerical weights.
| Size | Parameters | What It Means |
|---|---|---|
| 1B-3B | 1-3 billion | Lightweight, fast, runs on almost anything |
| 7B-8B | 7-8 billion | The sweet spot for most local use |
| 13B-14B | 13-14 billion | More capable, needs decent hardware |
| 30B-70B | 30-70 billion | High capability, requires powerful hardware |
| 100B+ | 100+ billion | Frontier models (GPT-4 class), requires data center hardware |
More parameters generally means better reasoning, broader knowledge, and more nuanced responses, but also more memory and slower generation.
For perspective: GPT-3.5 Turbo, the model that helped spark widespread AI adoption, is estimated at around 20 billion parameters. With quantization, models of this size can run on a capable home computer or laptop with 16GB+ RAM.
Docket's pre-installed models range from 1B to 14B, covering most use cases while remaining accessible on typical hardware.
What is Quantization?
Here's the key to running large models on consumer hardware: quantization.
A full-precision model stores each parameter as a 16-bit or 32-bit floating point number. A 7B parameter model at full precision needs ~14-28GB of RAM just to load, more than most laptops have.
Quantization compresses these numbers into smaller representations (8-bit, 4-bit, even 2-bit), dramatically reducing memory requirements with minimal quality loss.
| Precision | Bits per Parameter | 7B Model Size | Quality Impact |
|---|---|---|---|
| Full (FP16) | 16 bits | ~14 GB | Original quality |
| 8-bit (Q8) | 8 bits | ~7 GB | Minimal loss |
| 4-bit (Q4) | 4 bits | ~4 GB | Slight loss, usually unnoticeable |
| 2-bit (Q2) | 2 bits | ~2 GB | Noticeable degradation |
This is why you can run a "7 billion parameter" model on a laptop with 8GB RAM. Quantization makes it possible.
Why It Works
Neural networks are surprisingly resilient to reduced precision. The relationships between parameters matter more than their exact values. A well-quantized 4-bit model often performs nearly as well as the original, while using a quarter of the memory.
The trade-off is that very aggressive quantization (2-bit) starts to degrade quality noticeably, especially for complex reasoning tasks.
What is GGUF?
GGUF (GPT-Generated Unified Format) is the standard file format for running models locally. When you download a model for local use, you're usually downloading a .gguf file.
The filename tells you about the model:
Qwen2.5-7B-Instruct-Q4_K_M.gguf
- Qwen2.5 — The model family
- 7B — 7 billion parameters
- Instruct — Fine-tuned to follow instructions (vs. base models)
- Q4_K_M — Quantization level (4-bit, K-quant, Medium quality)
Common quantization suffixes you'll see:
- Q4_K_M — 4-bit, good balance of size and quality (most common)
- Q5_K_M — 5-bit, slightly better quality, larger files
- Q8_0 — 8-bit, near-original quality, much larger files
What is Hugging Face?
Hugging Face is the largest open repository for AI models. It's where researchers and developers share their models with the world. Thousands of models are available to download, from small efficient models to large capable ones.
When people create quantized versions of models for local use, they typically upload them to Hugging Face. Docket includes a built-in browser to search and download models directly.
Popular Model Families
| Family | Created By | Known For |
|---|---|---|
| Llama | Meta | Well-rounded, widely supported |
| Qwen | Alibaba | Strong multilingual support, good at coding |
| Gemma | Efficient, punches above its weight | |
| Mistral | Mistral AI | Fast, capable |
| Phi | Microsoft | Remarkably capable for their small size |
| DeepSeek | DeepSeek | Strong reasoning and coding |
New models are released regularly, and the community quickly creates quantized versions for local use.
Why This Matters
Understanding these concepts helps you:
- Choose the right model — Match model size to your hardware
- Read model names — Know what "Q4_K_M" means when browsing
- Set expectations — Understand why a 7B model might not match GPT-4
- Troubleshoot issues — If a model runs slowly, you'll know why
Next Steps
- Benefits & Trade-offs — Understand what local AI can and can't do
- Machine Specs — See what your hardware can run and get model recommendations
- Pre-installed Models — Explore what comes with Docket