Skip to main content

Models & Quantization

This page explains what AI models are, how they're measured, and the techniques that make running them locally possible.

Understanding Model Sizes

When you see a model name like "Llama 3.2 3B" or "Qwen 2.5 14B," the number refers to parameters, the model's "knowledge" encoded as numerical weights.

SizeParametersWhat It Means
1B-3B1-3 billionLightweight, fast, runs on almost anything
7B-8B7-8 billionThe sweet spot for most local use
13B-14B13-14 billionMore capable, needs decent hardware
30B-70B30-70 billionHigh capability, requires powerful hardware
100B+100+ billionFrontier models (GPT-4 class), requires data center hardware

More parameters generally means better reasoning, broader knowledge, and more nuanced responses, but also more memory and slower generation.

For perspective: GPT-3.5 Turbo, the model that helped spark widespread AI adoption, is estimated at around 20 billion parameters. With quantization, models of this size can run on a capable home computer or laptop with 16GB+ RAM.

Docket's pre-installed models range from 1B to 14B, covering most use cases while remaining accessible on typical hardware.

What is Quantization?

Here's the key to running large models on consumer hardware: quantization.

A full-precision model stores each parameter as a 16-bit or 32-bit floating point number. A 7B parameter model at full precision needs ~14-28GB of RAM just to load, more than most laptops have.

Quantization compresses these numbers into smaller representations (8-bit, 4-bit, even 2-bit), dramatically reducing memory requirements with minimal quality loss.

PrecisionBits per Parameter7B Model SizeQuality Impact
Full (FP16)16 bits~14 GBOriginal quality
8-bit (Q8)8 bits~7 GBMinimal loss
4-bit (Q4)4 bits~4 GBSlight loss, usually unnoticeable
2-bit (Q2)2 bits~2 GBNoticeable degradation

This is why you can run a "7 billion parameter" model on a laptop with 8GB RAM. Quantization makes it possible.

Why It Works

Neural networks are surprisingly resilient to reduced precision. The relationships between parameters matter more than their exact values. A well-quantized 4-bit model often performs nearly as well as the original, while using a quarter of the memory.

The trade-off is that very aggressive quantization (2-bit) starts to degrade quality noticeably, especially for complex reasoning tasks.

What is GGUF?

GGUF (GPT-Generated Unified Format) is the standard file format for running models locally. When you download a model for local use, you're usually downloading a .gguf file.

The filename tells you about the model:

Qwen2.5-7B-Instruct-Q4_K_M.gguf
  • Qwen2.5 — The model family
  • 7B — 7 billion parameters
  • Instruct — Fine-tuned to follow instructions (vs. base models)
  • Q4_K_M — Quantization level (4-bit, K-quant, Medium quality)

Common quantization suffixes you'll see:

  • Q4_K_M — 4-bit, good balance of size and quality (most common)
  • Q5_K_M — 5-bit, slightly better quality, larger files
  • Q8_0 — 8-bit, near-original quality, much larger files

What is Hugging Face? Hugging Face Logo

Hugging Face is the largest open repository for AI models. It's where researchers and developers share their models with the world. Thousands of models are available to download, from small efficient models to large capable ones.

When people create quantized versions of models for local use, they typically upload them to Hugging Face. Docket includes a built-in browser to search and download models directly.

FamilyCreated ByKnown For
LlamaMetaWell-rounded, widely supported
QwenAlibabaStrong multilingual support, good at coding
GemmaGoogleEfficient, punches above its weight
MistralMistral AIFast, capable
PhiMicrosoftRemarkably capable for their small size
DeepSeekDeepSeekStrong reasoning and coding

New models are released regularly, and the community quickly creates quantized versions for local use.

Why This Matters

Understanding these concepts helps you:

  • Choose the right model — Match model size to your hardware
  • Read model names — Know what "Q4_K_M" means when browsing
  • Set expectations — Understand why a 7B model might not match GPT-4
  • Troubleshoot issues — If a model runs slowly, you'll know why

Next Steps