The Complete Free Offline AI Model Catalog

How to read this page: every category explains the approach (how that modality actually works on local hardware), lists the best free/open-weight models, and tags each with the loaders that can run it. Sizes are approximate on-disk weight sizes at common quantizations.

Chat / General

Coding

Reasoning

Vision / Multimodal

Image Generation

Speech / Audio

Music

Video

Embeddings

Tiny / Edge

Chat & general assistants

The bread-and-butter category. These are instruction-tuned language models you'd use for writing, brainstorming, Q&A, summarisation and as the reasoning core of most agents.

Approach: Pull a quantized GGUF (commonly Q4_K_M or Q5_K_M) and run it on any llama.cpp-based loader. On Apple Silicon, MLX variants are ~30% faster. On NVIDIA, EXL2/AWQ + vLLM or ExLlamaV2 give the best throughput.

Model	Size	License	Best at	Runs on
DeepSeek V4 Pro / V4 FlashDeepSeek · MoE · 49B / 13B active · 1M ctx	284B–1.6T	MIT	Apr 2026 — open frontier-class quality, 1M context, ~10% of V3's KV cache.	Ollama llama.cpp vLLM Workstation
Llama 4 Scout & MaverickMeta · MoE · natively multimodal	17B–400B+	Llama 4 Community License	Maverick still tops MMLU among open models (85.5%). Scout fits one H100.	Ollama Open WebUI llama.cpp vLLM MLX
Qwen 3.5 / Qwen 3.6Alibaba · 27B · 35B-A3B · 122B-A10B · 397B-A17B	3B–397B	Apache 2.0 (most sizes)	SOTA tool-use & JSON; A-series MoEs activate tiny slices for speed.	Ollama Open WebUI llama.cpp vLLM MLX
Gemma 4Google · 2B · 9B · 27B · multimodal	2B–27B	Gemma license (permissive)	Frontier-level quality at every size, native vision, friendly on consumer GPUs.	Ollama Open WebUI llama.cpp MLX
Mistral Large 3 / Small 4Mistral AI · 24B · 675B / 41B active	24B–675B	Apache 2.0	Multimodal, 80+ languages, very strong function-calling.	Ollama Open WebUI llama.cpp vLLM
GLM-5.1Zhipu AI · open · agentic-tuned	9B–355B	Open MIT-style	Cleanest open tool-use schema; great for long agent loops.	Ollama vLLM SGLang
gpt-oss-120b / 20bOpenAI · MoE · 5.1B active	20B–120B	Apache 2.0	OpenAI's first open release. 120B fits a single 80 GB H100; 20B runs on an M-series Mac.	Ollama vLLM llama.cpp MLX
Hermes 4 70B / 405BNous Research · agent-trace tuned	70B–405B	Llama license	Trained mostly on real agent traces — best-in-class tool-calling & long-horizon work.	Ollama Open WebUI llama.cpp vLLM
Llama 3.x (3.1 / 3.2 / 3.3)Meta · 1B · 3B · 8B · 70B	1B–70B	Llama license	Still the most-used local default in 2026 — proven, well-supported.	Ollama Open WebUI llama.cpp vLLM MLX
Phi-4 / Phi-4-miniMicrosoft · 3.8B · 14B	3.8B–14B	MIT	Tiny models that behave like big ones. Great on CPU and edge devices.	Ollama Open WebUI llama.cpp

Coding models

Models fine-tuned on source code. Use them as local Copilot autocompletes, repo-aware refactorers, or the backend of tools like Continue, Aider and Open Interpreter.

Approach: Most coding models ship as GGUF and run great under Ollama / Open WebUI. Pair the model with an IDE or CLI agent (Continue, Aider, Cline) that understands repo context. For fast autocomplete, pick a 1B–7B FIM-trained ("fill-in-the-middle") model.

Model	Size	License	Best at	Runs on
Qwen3 CoderAlibaba · 1.5B → 480B-A35B MoE	1.5B–480B	Apache 2.0	2026 SOTA open code model. 480B-A35B MoE matches closed Copilot quality on real PRs.	OllamaOpen WebUIllama.cppvLLM
DeepSeek-Coder V3 / V4-CodeDeepSeek · MoE · open weights	16B–671B	MIT	Massive context, best-in-class at large-repo reasoning & refactors.	Ollamallama.cppvLLM
CodestralMistral AI · 22B	22B	MNPL (non-prod free)	Multi-language code completion & generation.	OllamaOpen WebUIllama.cpp
StarCoder 2BigCode · 3B · 7B · 15B	3B–15B	BigCode OpenRAIL-M	FIM autocomplete, 600+ languages.	Ollamallama.cppvLLM
CodeLlamaMeta · 7B · 13B · 34B · 70B	7B–70B	Llama license	Mature, well-supported, lots of specialist variants.	OllamaOpen WebUIllama.cpp
Granite CodeIBM · 3B · 8B · 20B · 34B	3B–34B	Apache 2.0	Enterprise-friendly license, solid quality.	Ollamallama.cppvLLM
StableCode / Replit CodeStability · Replit	1.3B–3B	Permissive	Tiny autocomplete models — perfect for laptops.	Ollamallama.cpp

Reasoning & math models

A newer class of "think-before-you-speak" models that produce an internal chain-of-thought before an answer. They're slower but crush math, logic, planning and agent decisions.

Approach: Reasoning models generate 2×–10× more tokens than a normal chat model because they "think" first. Budget for longer inference time. They work on any llama.cpp loader but vLLM's speculative decoding helps a lot on GPU.

Model	Size	License	Best at	Runs on
DeepSeek V4 (reasoning)DeepSeek · 1.6T MoE / 49B active	284B–1.6T	MIT	2026 SOTA — V4 carries the R1 reasoning lineage forward at frontier scale, with a 1M-token "think" budget.	Ollamallama.cppvLLM
DeepSeek-R1 / R1-DistillDeepSeek · 1.5B → 671B	1.5B–671B	MIT	The breakthrough that started the open reasoning wave — distills down to 7B that still beats GPT-4o on math.	OllamaOpen WebUIllama.cppvLLM
QwQ-Max / Qwen3-ReasoningAlibaba · 32B · 72B	32B–72B	Apache 2.0	Deep step-by-step analysis, excellent backbone for agent loops.	Ollamallama.cppvLLM
gpt-oss-120b (reasoning mode)OpenAI · 120B MoE / 5.1B active	120B	Apache 2.0	OpenAI's open release ships with a built-in reasoning mode — runs on a single 80 GB H100.	OllamavLLMllama.cpp
OpenThinker · DeepThinker · Marco-o1Community reasoning tunes	7B–32B	Apache 2.0	Research-grade thinkers you can actually run on a consumer GPU.	Ollamallama.cpp
Mathstral / DeepSeek-MathSpecialist math models	7B	Apache / Custom	Symbolic maths, proofs, formal reasoning.	Ollamallama.cpp

Vision & multimodal models

Models that accept images (and sometimes video) alongside text. Essential for computer-use agents like OpenClaw, document understanding, OCR and visual Q&A.

Approach: Vision models bundle a text LLM with a vision encoder (usually CLIP or SigLIP). Ollama and Open WebUI both support them natively — just paste or attach an image. For agentic "see the screen" workflows, pair them with OpenClaw or Open Interpreter OS-mode.

Model	Size	License	Best at	Runs on
Llama 4 (native VL) · Llama 3.2 VisionMeta · MoE · 11B · 90B · 400B+	11B–400B+	Llama 4 / Llama license	Llama 4 is natively multimodal end-to-end; 3.2 Vision is the proven workhorse.	OllamaOpen WebUIllama.cppvLLM
Qwen3-VL / Qwen2.5-VLAlibaba · 2B · 7B · 72B · 235B	2B–235B	Apache 2.0 / Qwen license	UI screenshots, OCR, charts, video frames. Top pick for computer-use agents.	OllamaOpen WebUIvLLM
Gemma 4 (vision)Google · 9B · 27B	9B–27B	Gemma license	Compact, native vision, very fast on a single 24 GB GPU.	Ollamallama.cppMLX
LLaVA / LLaVA-NeXTCommunity · 7B · 13B · 34B	7B–34B	Apache 2.0 (weights vary)	The classic open vision-language model family.	OllamaOpen WebUIllama.cpp
PixtralMistral AI · 12B	12B	Apache 2.0	High-quality image reasoning, strong at documents.	OllamavLLM
InternVL 3Shanghai AI Lab · 1B → 108B	1B–108B	MIT (weights)	2026 update — leads most open vision benchmarks & long-video understanding.	vLLMllama.cpp
MiniCPM-V / Florence-2Edge vision models	0.2B–8B	Apache 2.0	Tiny vision models for phones and IoT.	Ollamallama.cpp

Image generation

Diffusion and flow-matching models that create images from text prompts. Completely different plumbing from LLMs — different loaders, different file formats.

Approach: Ollama and Open WebUI do not run image models. You need a diffusion-specific loader: ComfyUI (node graphs, most powerful), Automatic1111 / Forge (classic webui), InvokeAI (polished desktop app), SwarmUI (modern UI on top of Comfy), or Draw Things (macOS/iOS native). Weights come as .safetensors checkpoints plus optional LoRAs, VAEs and ControlNets.

Model	VRAM needed	License	Best at	Runs on
FLUX.2 [dev] / [pro]Black Forest Labs · production-grade	14–28 GB	FLUX.2 community / commercial license	2026 SOTA — sharper prompt adherence, real text rendering, near-photoreal quality.	ComfyUIForgeSwarmUIDraw Things
FLUX.1 [dev] / [schnell]Black Forest Labs · 12B	12–24 GB	FLUX.1 non-commercial / Apache	Still the most-downloaded open checkpoint of 2025 — huge LoRA library.	ComfyUIForgeSwarmUIDraw Things
Stable Diffusion 3.5Stability AI · Medium · Large	8–16 GB	Stability Community License	Best supported ecosystem, huge LoRA catalog.	ComfyUIA1111InvokeAIForge
SDXL / SDXL TurboStability AI · 3.5B	6–12 GB	OpenRAIL++	The workhorse. Huge community, countless fine-tunes.	ComfyUIA1111InvokeAIDraw Things
SD 1.5 & fine-tunesRealistic Vision, DreamShaper…	4–6 GB	OpenRAIL / CreativeML	Still unbeaten for specific artistic styles via fine-tunes.	ComfyUIA1111Forge
Playground v3 / KolorsCommunity flagships	10–16 GB	Custom / Apache	Distinctive aesthetics, great for commercial art.	ComfyUI
ControlNet · IP-Adapter · LoRAsNot models — add-ons	~100 MB each	Mostly permissive	Pose control, style transfer, subject consistency, identity.	ComfyUIA1111InvokeAI

Speech: transcription & text-to-speech

Models that turn audio into text (STT / ASR) and text back into natural-sounding voices (TTS). Local speech is now at or above cloud quality.

Approach: STT typically runs via whisper.cpp (C++ port of OpenAI Whisper) or faster-whisper (CTranslate2). TTS uses its own engines — Piper, Coqui XTTS, F5-TTS, StyleTTS2. LocalAI wraps many of these into one OpenAI-compatible server.

Model	Type	License	Best at	Runs on
Whisper v3 · distil-whisperOpenAI · tiny → large-v3	STT	MIT	99-language transcription, the de-facto standard.	whisper.cppfaster-whisperLocalAI
Parakeet / CanaryNVIDIA NeMo	STT	CC-BY-4.0	Ultra-fast English transcription.	NeMofaster-whisper
Piperrhasspy · dozens of voices	TTS	MIT	Lightning-fast, CPU-friendly TTS. Great for assistants.	PiperLocalAIHome Assistant
Coqui XTTS v2Coqui · voice cloning	TTS	CPML (non-commercial)	Clones any voice from 6 seconds of audio.	Coqui TTSLocalAI
F5-TTS / StyleTTS 2Natural prosody	TTS	MIT / CC-BY-NC	Extremely natural, expressive synthesis.	Native PythonComfyUI nodes
KokoroCompact all-in-one TTS	TTS	Apache 2.0	Tiny, fast, surprisingly good quality.	Native PythonLocalAI

Music & sound generation

From background loops to full songs with vocals — local music models are maturing fast.

Approach: Music models run via Python notebooks, dedicated apps, or ComfyUI audio nodes. Generation is usually slower than real-time on consumer hardware, so you "bake" tracks rather than stream them.

Model	Type	License	Best at	Runs on
Stable Audio OpenStability AI	Music / SFX	Stability Community	47-second clips, sound effects, loops.	ComfyUINative Python
MusicGen / AudioGenMeta	Music / SFX	CC-BY-NC	Text-to-music, melody-conditioned generation.	AudioCraftComfyUI
YuE / OpenMusicOpen song generators	Full songs	Apache 2.0	Multi-minute songs with vocals and structure.	Native Python
Barksuno-ai	Voice / SFX	MIT	Expressive voice acting, laughs, music snippets.	Native PythonComfyUI

Video generation

The newest — and most hardware-hungry — local modality. Expect seconds of video in exchange for minutes of GPU time.

Approach: Every serious local video model runs through ComfyUI today. They demand heavy VRAM (16–48 GB) for high resolutions, but quantized GGUF versions are starting to land for consumer GPUs.

Model	VRAM	License	Best at	Runs on
HunyuanVideo 1.5Tencent · 13B+	12–80 GB	Custom (open)	2026 update — quantized GGUFs now run on consumer 24 GB GPUs with cinematic quality.	ComfyUI
Wan 2.5 / CogVideoX-2Community video models	12–24 GB	Apache 2.0	Runs on a single 4090, great prompt adherence and motion consistency.	ComfyUI
LTX-VideoLightricks	8–16 GB	OpenRAIL	Fast, near-real-time short video generation.	ComfyUI
Mochi 1 · AnimateDiffImage-to-video, animation	8–24 GB	Apache 2.0	Animating stills, consistent motion, loops.	ComfyUIA1111 ext.

Embeddings & rerankers

Not chat models — they turn text into vectors for semantic search, RAG and memory. Essential for any agent that needs to "remember".

Approach: Embedding models are small (often < 1 GB) and run fast on CPU. Ollama and Open WebUI expose them via the /v1/embeddings endpoint. LocalAI and AnythingLLM wire them in for you.

Model	Size	License	Best at	Runs on
nomic-embed-textNomic · 137M	~275 MB	Apache 2.0	Great default, 8k context, multilingual variant.	OllamaOpen WebUILocalAI
BGE-M3 / BGE-largeBAAI · multi-function	~1.3 GB	MIT	Top of MTEB; dense + sparse + ColBERT in one model.	Ollamallama.cppLocalAI
mxbai-embed-largeMixedbread	~670 MB	Apache 2.0	High quality for its size, great for English RAG.	Ollamallama.cpp
Jina Embeddings v3Jina AI	~1.1 GB	CC-BY-NC	Long-context, task-LoRA-switchable.	Native PythonOllama (Q)
bge-reranker-v2-m3BAAI · cross-encoder	~570 MB	Apache 2.0	Reranker — huge quality boost on top of any embedder.	LocalAINative Python

Tiny & edge models

Sub-4B-parameter models that run well on CPUs, phones, Raspberry Pis, and modest laptops. Surprisingly capable — and perfect for always-on assistants.

Approach: These models shine at short prompts and structured tasks. Run Q4 GGUFs on CPU through Ollama or llama.cpp. On iOS / Android use MLC-LLM, ExecuTorch or llama.cpp mobile builds.

Model	Size	License	Best at	Runs on
Gemma 4 2BGoogle · vision-capable	2B	Gemma license	2026 — punches above 9B-class quality on a phone-sized footprint.	Ollamallama.cpp
Llama 3.2 1B / 3BMeta	1B–3B	Llama license	Best all-round tiny chat model in the wild.	OllamaOpen WebUIllama.cppMLC-LLM
Qwen 3 0.5B / 1.5B / 3BAlibaba	0.5B–3B	Apache 2.0	Astonishing quality per parameter, full tool-use support.	OllamaOpen WebUIllama.cpp
Phi-4-miniMicrosoft · 3.8B	3.8B	MIT	Reasoning in a small package — runs comfortably on CPU.	Ollamallama.cpp
SmolLM 2Hugging Face · 135M · 360M · 1.7B	0.1B–1.7B	Apache 2.0	Microscopic assistants, runs on anything.	Ollamallama.cppExecuTorch
TinyLlama / MobileLLMMobile-first	125M–1.1B	Apache 2.0	On-device draft-models & smart-reply.	llama.cppMLC-LLMExecuTorch

Quick sizing guide

What can your machine actually run?

A rough rule of thumb for Q4-quantized GGUF models. Real-world numbers vary by quantization, context length and loader.

💻

8 GB RAM laptop (CPU only)

Up to ~3B chat (Llama 3.2 3B, Qwen 3 3B, Gemma 4 2B), tiny code models, embeddings, Whisper small, Piper TTS, SD 1.5 on CPU (slow).

Entry

🖥️

16 GB RAM / 8–12 GB VRAM

9B chat at good speed (Gemma 4 9B, Qwen 3.5 7B), 7B code, 7B vision, FLUX.1 / SDXL image gen, full Whisper large-v3, light ComfyUI video.

Sweet spot

🎮

RTX 5090 / M4 Max 64 GB

27B–70B chat at interactive speeds, FLUX.2, HunyuanVideo 1.5 GGUF, DeepSeek-R1-Distill 32B, gpt-oss-20b, full agent stacks.

Prosumer

🏭

Multi-GPU workstation / server

DeepSeek V4 Pro 1.6T, Llama 4 Maverick 400B+, gpt-oss-120b, Mistral Large 3 675B, HunyuanVideo full — anything the open ecosystem ships in 2026.

Workstation

The Offline Model Catalog