Models & Quantization

This page explains what AI models are, how they're measured, and the techniques that make running them locally possible.

Understanding Model Sizes

When you see a model name like "Llama 3.2 3B" or "Qwen 2.5 14B," the number refers to parameters, the model's "knowledge" encoded as numerical weights.

Size	Parameters	What It Means
1B-3B	1-3 billion	Lightweight, fast, runs on almost anything
7B-8B	7-8 billion	The sweet spot for most local use
13B-14B	13-14 billion	More capable, needs decent hardware
30B-70B	30-70 billion	High capability, requires powerful hardware
100B+	100+ billion	Frontier models (GPT-4 class), requires data center hardware

More parameters generally means better reasoning, broader knowledge, and more nuanced responses, but also more memory and slower generation.

For perspective: GPT-3.5 Turbo, the model that helped spark widespread AI adoption, is estimated at around 20 billion parameters. With quantization, models of this size can run on a capable home computer or laptop with 16GB+ RAM.

Docket's pre-installed models range from 1B to 14B, covering most use cases while remaining accessible on typical hardware.

What is Quantization?

Here's the key to running large models on consumer hardware: quantization.

A full-precision model stores each parameter as a 16-bit or 32-bit floating point number. A 7B parameter model at full precision needs ~14-28GB of RAM just to load, more than most laptops have.

Quantization compresses these numbers into smaller representations (8-bit, 4-bit, even 2-bit), dramatically reducing memory requirements with minimal quality loss.

Precision	Bits per Parameter	7B Model Size	Quality Impact
Full (FP16)	16 bits	~14 GB	Original quality
8-bit (Q8)	8 bits	~7 GB	Minimal loss
4-bit (Q4)	4 bits	~4 GB	Slight loss, usually unnoticeable
2-bit (Q2)	2 bits	~2 GB	Noticeable degradation

This is why you can run a "7 billion parameter" model on a laptop with 8GB RAM. Quantization makes it possible.

Why It Works

Neural networks are surprisingly resilient to reduced precision. The relationships between parameters matter more than their exact values. A well-quantized 4-bit model often performs nearly as well as the original, while using a quarter of the memory.

The trade-off is that very aggressive quantization (2-bit) starts to degrade quality noticeably, especially for complex reasoning tasks.

What is GGUF?

GGUF (GPT-Generated Unified Format) is the standard file format for running models locally. When you download a model for local use, you're usually downloading a .gguf file.

The filename tells you about the model:

Qwen2.5-7B-Instruct-Q4_K_M.gguf

Qwen2.5 — The model family
7B — 7 billion parameters
Instruct — Fine-tuned to follow instructions (vs. base models)
Q4_K_M — Quantization level (4-bit, K-quant, Medium quality)

Common quantization suffixes you'll see:

Q4_K_M — 4-bit, good balance of size and quality (most common)
Q5_K_M — 5-bit, slightly better quality, larger files
Q8_0 — 8-bit, near-original quality, much larger files

What is Hugging Face?

Hugging Face is the largest open repository for AI models. It's where researchers and developers share their models with the world. Thousands of models are available to download, from small efficient models to large capable ones.

When people create quantized versions of models for local use, they typically upload them to Hugging Face. Docket includes a built-in browser to search and download models directly.

Popular Model Families

Family	Created By	Known For
Llama	Meta	Well-rounded, widely supported
Qwen	Alibaba	Strong multilingual support, good at coding
Gemma	Google	Efficient, punches above its weight
Mistral	Mistral AI	Fast, capable
Phi	Microsoft	Remarkably capable for their small size
DeepSeek	DeepSeek	Strong reasoning and coding

New models are released regularly, and the community quickly creates quantized versions for local use.

Why This Matters

Understanding these concepts helps you:

Choose the right model — Match model size to your hardware
Read model names — Know what "Q4_K_M" means when browsing
Set expectations — Understand why a 7B model might not match GPT-4
Troubleshoot issues — If a model runs slowly, you'll know why

Next Steps

Benefits & Trade-offs — Understand what local AI can and can't do
Machine Specs — See what your hardware can run and get model recommendations
Pre-installed Models — Explore what comes with Docket

Understanding Model Sizes​

What is Quantization?​

Why It Works​

What is GGUF?​

What is Hugging Face? ​

Popular Model Families​

Why This Matters​

Next Steps​