How to read this page: every category explains the approach (how that modality actually works on local hardware), lists the best free/open-weight models, and tags each with the loaders that can run it. Sizes are approximate on-disk weight sizes at common quantizations.

Chat & general assistants

The bread-and-butter category. These are instruction-tuned language models you'd use for writing, brainstorming, Q&A, summarisation and as the reasoning core of most agents.

Approach:  Pull a quantized GGUF (commonly Q4_K_M or Q5_K_M) and run it on any llama.cpp-based loader. On Apple Silicon, MLX variants are ~30% faster. On NVIDIA, EXL2/AWQ + vLLM or ExLlamaV2 give the best throughput.
ModelSizeLicenseBest atRuns on
DeepSeek V4 Pro / V4 FlashDeepSeek · MoE · 49B / 13B active · 1M ctx 284B–1.6T MIT Apr 2026 — open frontier-class quality, 1M context, ~10% of V3's KV cache.
Ollama llama.cpp vLLM Workstation
Llama 4 Scout & MaverickMeta · MoE · natively multimodal 17B–400B+ Llama 4 Community License Maverick still tops MMLU among open models (85.5%). Scout fits one H100.
Ollama Open WebUI llama.cpp vLLM MLX
Qwen 3.5 / Qwen 3.6Alibaba · 27B · 35B-A3B · 122B-A10B · 397B-A17B 3B–397B Apache 2.0 (most sizes) SOTA tool-use & JSON; A-series MoEs activate tiny slices for speed.
Ollama Open WebUI llama.cpp vLLM MLX
Gemma 4Google · 2B · 9B · 27B · multimodal 2B–27B Gemma license (permissive) Frontier-level quality at every size, native vision, friendly on consumer GPUs.
Ollama Open WebUI llama.cpp MLX
Mistral Large 3 / Small 4Mistral AI · 24B · 675B / 41B active 24B–675B Apache 2.0 Multimodal, 80+ languages, very strong function-calling.
Ollama Open WebUI llama.cpp vLLM
GLM-5.1Zhipu AI · open · agentic-tuned 9B–355B Open MIT-style Cleanest open tool-use schema; great for long agent loops.
Ollama vLLM SGLang
gpt-oss-120b / 20bOpenAI · MoE · 5.1B active 20B–120B Apache 2.0 OpenAI's first open release. 120B fits a single 80 GB H100; 20B runs on an M-series Mac.
Ollama vLLM llama.cpp MLX
Hermes 4 70B / 405BNous Research · agent-trace tuned 70B–405B Llama license Trained mostly on real agent traces — best-in-class tool-calling & long-horizon work.
Ollama Open WebUI llama.cpp vLLM
Llama 3.x (3.1 / 3.2 / 3.3)Meta · 1B · 3B · 8B · 70B 1B–70B Llama license Still the most-used local default in 2026 — proven, well-supported.
Ollama Open WebUI llama.cpp vLLM MLX
Phi-4 / Phi-4-miniMicrosoft · 3.8B · 14B 3.8B–14B MIT Tiny models that behave like big ones. Great on CPU and edge devices.
Ollama Open WebUI llama.cpp

Coding models

Models fine-tuned on source code. Use them as local Copilot autocompletes, repo-aware refactorers, or the backend of tools like Continue, Aider and Open Interpreter.

Approach:  Most coding models ship as GGUF and run great under Ollama / Open WebUI. Pair the model with an IDE or CLI agent (Continue, Aider, Cline) that understands repo context. For fast autocomplete, pick a 1B–7B FIM-trained ("fill-in-the-middle") model.
ModelSizeLicenseBest atRuns on
Qwen3 CoderAlibaba · 1.5B → 480B-A35B MoE 1.5B–480B Apache 2.0 2026 SOTA open code model. 480B-A35B MoE matches closed Copilot quality on real PRs.
OllamaOpen WebUIllama.cppvLLM
DeepSeek-Coder V3 / V4-CodeDeepSeek · MoE · open weights 16B–671B MIT Massive context, best-in-class at large-repo reasoning & refactors.
Ollamallama.cppvLLM
CodestralMistral AI · 22B 22B MNPL (non-prod free) Multi-language code completion & generation.
OllamaOpen WebUIllama.cpp
StarCoder 2BigCode · 3B · 7B · 15B 3B–15B BigCode OpenRAIL-M FIM autocomplete, 600+ languages.
Ollamallama.cppvLLM
CodeLlamaMeta · 7B · 13B · 34B · 70B 7B–70B Llama license Mature, well-supported, lots of specialist variants.
OllamaOpen WebUIllama.cpp
Granite CodeIBM · 3B · 8B · 20B · 34B 3B–34B Apache 2.0 Enterprise-friendly license, solid quality.
Ollamallama.cppvLLM
StableCode / Replit CodeStability · Replit 1.3B–3B Permissive Tiny autocomplete models — perfect for laptops.
Ollamallama.cpp

Reasoning & math models

A newer class of "think-before-you-speak" models that produce an internal chain-of-thought before an answer. They're slower but crush math, logic, planning and agent decisions.

Approach:  Reasoning models generate 2×–10× more tokens than a normal chat model because they "think" first. Budget for longer inference time. They work on any llama.cpp loader but vLLM's speculative decoding helps a lot on GPU.
ModelSizeLicenseBest atRuns on
DeepSeek V4 (reasoning)DeepSeek · 1.6T MoE / 49B active 284B–1.6T MIT 2026 SOTA — V4 carries the R1 reasoning lineage forward at frontier scale, with a 1M-token "think" budget.
Ollamallama.cppvLLM
DeepSeek-R1 / R1-DistillDeepSeek · 1.5B → 671B 1.5B–671B MIT The breakthrough that started the open reasoning wave — distills down to 7B that still beats GPT-4o on math.
OllamaOpen WebUIllama.cppvLLM
QwQ-Max / Qwen3-ReasoningAlibaba · 32B · 72B 32B–72B Apache 2.0 Deep step-by-step analysis, excellent backbone for agent loops.
Ollamallama.cppvLLM
gpt-oss-120b (reasoning mode)OpenAI · 120B MoE / 5.1B active 120B Apache 2.0 OpenAI's open release ships with a built-in reasoning mode — runs on a single 80 GB H100.
OllamavLLMllama.cpp
OpenThinker · DeepThinker · Marco-o1Community reasoning tunes 7B–32B Apache 2.0 Research-grade thinkers you can actually run on a consumer GPU.
Ollamallama.cpp
Mathstral / DeepSeek-MathSpecialist math models 7B Apache / Custom Symbolic maths, proofs, formal reasoning.
Ollamallama.cpp

Vision & multimodal models

Models that accept images (and sometimes video) alongside text. Essential for computer-use agents like OpenClaw, document understanding, OCR and visual Q&A.

Approach:  Vision models bundle a text LLM with a vision encoder (usually CLIP or SigLIP). Ollama and Open WebUI both support them natively — just paste or attach an image. For agentic "see the screen" workflows, pair them with OpenClaw or Open Interpreter OS-mode.
ModelSizeLicenseBest atRuns on
Llama 4 (native VL) · Llama 3.2 VisionMeta · MoE · 11B · 90B · 400B+ 11B–400B+ Llama 4 / Llama license Llama 4 is natively multimodal end-to-end; 3.2 Vision is the proven workhorse.
OllamaOpen WebUIllama.cppvLLM
Qwen3-VL / Qwen2.5-VLAlibaba · 2B · 7B · 72B · 235B 2B–235B Apache 2.0 / Qwen license UI screenshots, OCR, charts, video frames. Top pick for computer-use agents.
OllamaOpen WebUIvLLM
Gemma 4 (vision)Google · 9B · 27B 9B–27B Gemma license Compact, native vision, very fast on a single 24 GB GPU.
Ollamallama.cppMLX
LLaVA / LLaVA-NeXTCommunity · 7B · 13B · 34B 7B–34B Apache 2.0 (weights vary) The classic open vision-language model family.
OllamaOpen WebUIllama.cpp
PixtralMistral AI · 12B 12B Apache 2.0 High-quality image reasoning, strong at documents.
OllamavLLM
InternVL 3Shanghai AI Lab · 1B → 108B 1B–108B MIT (weights) 2026 update — leads most open vision benchmarks & long-video understanding.
vLLMllama.cpp
MiniCPM-V / Florence-2Edge vision models 0.2B–8B Apache 2.0 Tiny vision models for phones and IoT.
Ollamallama.cpp

Image generation

Diffusion and flow-matching models that create images from text prompts. Completely different plumbing from LLMs — different loaders, different file formats.

Approach:  Ollama and Open WebUI do not run image models. You need a diffusion-specific loader: ComfyUI (node graphs, most powerful), Automatic1111 / Forge (classic webui), InvokeAI (polished desktop app), SwarmUI (modern UI on top of Comfy), or Draw Things (macOS/iOS native). Weights come as .safetensors checkpoints plus optional LoRAs, VAEs and ControlNets.
ModelVRAM neededLicenseBest atRuns on
FLUX.2 [dev] / [pro]Black Forest Labs · production-grade 14–28 GB FLUX.2 community / commercial license 2026 SOTA — sharper prompt adherence, real text rendering, near-photoreal quality.
ComfyUIForgeSwarmUIDraw Things
FLUX.1 [dev] / [schnell]Black Forest Labs · 12B 12–24 GB FLUX.1 non-commercial / Apache Still the most-downloaded open checkpoint of 2025 — huge LoRA library.
ComfyUIForgeSwarmUIDraw Things
Stable Diffusion 3.5Stability AI · Medium · Large 8–16 GB Stability Community License Best supported ecosystem, huge LoRA catalog.
ComfyUIA1111InvokeAIForge
SDXL / SDXL TurboStability AI · 3.5B 6–12 GB OpenRAIL++ The workhorse. Huge community, countless fine-tunes.
ComfyUIA1111InvokeAIDraw Things
SD 1.5 & fine-tunesRealistic Vision, DreamShaper… 4–6 GB OpenRAIL / CreativeML Still unbeaten for specific artistic styles via fine-tunes.
ComfyUIA1111Forge
Playground v3 / KolorsCommunity flagships 10–16 GB Custom / Apache Distinctive aesthetics, great for commercial art.
ComfyUI
ControlNet · IP-Adapter · LoRAsNot models — add-ons ~100 MB each Mostly permissive Pose control, style transfer, subject consistency, identity.
ComfyUIA1111InvokeAI

Speech: transcription & text-to-speech

Models that turn audio into text (STT / ASR) and text back into natural-sounding voices (TTS). Local speech is now at or above cloud quality.

Approach:  STT typically runs via whisper.cpp (C++ port of OpenAI Whisper) or faster-whisper (CTranslate2). TTS uses its own engines — Piper, Coqui XTTS, F5-TTS, StyleTTS2. LocalAI wraps many of these into one OpenAI-compatible server.
ModelTypeLicenseBest atRuns on
Whisper v3 · distil-whisperOpenAI · tiny → large-v3 STT MIT 99-language transcription, the de-facto standard.
whisper.cppfaster-whisperLocalAI
Parakeet / CanaryNVIDIA NeMo STT CC-BY-4.0 Ultra-fast English transcription.
NeMofaster-whisper
Piperrhasspy · dozens of voices TTS MIT Lightning-fast, CPU-friendly TTS. Great for assistants.
PiperLocalAIHome Assistant
Coqui XTTS v2Coqui · voice cloning TTS CPML (non-commercial) Clones any voice from 6 seconds of audio.
Coqui TTSLocalAI
F5-TTS / StyleTTS 2Natural prosody TTS MIT / CC-BY-NC Extremely natural, expressive synthesis.
Native PythonComfyUI nodes
KokoroCompact all-in-one TTS TTS Apache 2.0 Tiny, fast, surprisingly good quality.
Native PythonLocalAI

Music & sound generation

From background loops to full songs with vocals — local music models are maturing fast.

Approach:  Music models run via Python notebooks, dedicated apps, or ComfyUI audio nodes. Generation is usually slower than real-time on consumer hardware, so you "bake" tracks rather than stream them.
ModelTypeLicenseBest atRuns on
Stable Audio OpenStability AIMusic / SFXStability Community47-second clips, sound effects, loops.
ComfyUINative Python
MusicGen / AudioGenMetaMusic / SFXCC-BY-NCText-to-music, melody-conditioned generation.
AudioCraftComfyUI
YuE / OpenMusicOpen song generatorsFull songsApache 2.0Multi-minute songs with vocals and structure.
Native Python
Barksuno-aiVoice / SFXMITExpressive voice acting, laughs, music snippets.
Native PythonComfyUI

Video generation

The newest — and most hardware-hungry — local modality. Expect seconds of video in exchange for minutes of GPU time.

Approach:  Every serious local video model runs through ComfyUI today. They demand heavy VRAM (16–48 GB) for high resolutions, but quantized GGUF versions are starting to land for consumer GPUs.
ModelVRAMLicenseBest atRuns on
HunyuanVideo 1.5Tencent · 13B+12–80 GBCustom (open)2026 update — quantized GGUFs now run on consumer 24 GB GPUs with cinematic quality.
ComfyUI
Wan 2.5 / CogVideoX-2Community video models12–24 GBApache 2.0Runs on a single 4090, great prompt adherence and motion consistency.
ComfyUI
LTX-VideoLightricks8–16 GBOpenRAILFast, near-real-time short video generation.
ComfyUI
Mochi 1 · AnimateDiffImage-to-video, animation8–24 GBApache 2.0Animating stills, consistent motion, loops.
ComfyUIA1111 ext.

Embeddings & rerankers

Not chat models — they turn text into vectors for semantic search, RAG and memory. Essential for any agent that needs to "remember".

Approach:  Embedding models are small (often < 1 GB) and run fast on CPU. Ollama and Open WebUI expose them via the /v1/embeddings endpoint. LocalAI and AnythingLLM wire them in for you.
ModelSizeLicenseBest atRuns on
nomic-embed-textNomic · 137M~275 MBApache 2.0Great default, 8k context, multilingual variant.
OllamaOpen WebUILocalAI
BGE-M3 / BGE-largeBAAI · multi-function~1.3 GBMITTop of MTEB; dense + sparse + ColBERT in one model.
Ollamallama.cppLocalAI
mxbai-embed-largeMixedbread~670 MBApache 2.0High quality for its size, great for English RAG.
Ollamallama.cpp
Jina Embeddings v3Jina AI~1.1 GBCC-BY-NCLong-context, task-LoRA-switchable.
Native PythonOllama (Q)
bge-reranker-v2-m3BAAI · cross-encoder~570 MBApache 2.0Reranker — huge quality boost on top of any embedder.
LocalAINative Python

Tiny & edge models

Sub-4B-parameter models that run well on CPUs, phones, Raspberry Pis, and modest laptops. Surprisingly capable — and perfect for always-on assistants.

Approach:  These models shine at short prompts and structured tasks. Run Q4 GGUFs on CPU through Ollama or llama.cpp. On iOS / Android use MLC-LLM, ExecuTorch or llama.cpp mobile builds.
ModelSizeLicenseBest atRuns on
Gemma 4 2BGoogle · vision-capable2BGemma license2026 — punches above 9B-class quality on a phone-sized footprint.
Ollamallama.cpp
Llama 3.2 1B / 3BMeta1B–3BLlama licenseBest all-round tiny chat model in the wild.
OllamaOpen WebUIllama.cppMLC-LLM
Qwen 3 0.5B / 1.5B / 3BAlibaba0.5B–3BApache 2.0Astonishing quality per parameter, full tool-use support.
OllamaOpen WebUIllama.cpp
Phi-4-miniMicrosoft · 3.8B3.8BMITReasoning in a small package — runs comfortably on CPU.
Ollamallama.cpp
SmolLM 2Hugging Face · 135M · 360M · 1.7B0.1B–1.7BApache 2.0Microscopic assistants, runs on anything.
Ollamallama.cppExecuTorch
TinyLlama / MobileLLMMobile-first125M–1.1BApache 2.0On-device draft-models & smart-reply.
llama.cppMLC-LLMExecuTorch
Quick sizing guide

What can your machine actually run?

A rough rule of thumb for Q4-quantized GGUF models. Real-world numbers vary by quantization, context length and loader.

💻

8 GB RAM laptop (CPU only)

Up to ~3B chat (Llama 3.2 3B, Qwen 3 3B, Gemma 4 2B), tiny code models, embeddings, Whisper small, Piper TTS, SD 1.5 on CPU (slow).

Entry
🖥️

16 GB RAM / 8–12 GB VRAM

9B chat at good speed (Gemma 4 9B, Qwen 3.5 7B), 7B code, 7B vision, FLUX.1 / SDXL image gen, full Whisper large-v3, light ComfyUI video.

Sweet spot
🎮

RTX 5090 / M4 Max 64 GB

27B–70B chat at interactive speeds, FLUX.2, HunyuanVideo 1.5 GGUF, DeepSeek-R1-Distill 32B, gpt-oss-20b, full agent stacks.

Prosumer
🏭

Multi-GPU workstation / server

DeepSeek V4 Pro 1.6T, Llama 4 Maverick 400B+, gpt-oss-120b, Mistral Large 3 675B, HunyuanVideo full — anything the open ecosystem ships in 2026.

Workstation

Models change weekly. This catalog changes with them.

Bookmark this page and check the comparison table for the right loader to pair with each model.

See all loaders compared → Back to runtimes

Join the global local-AI community

Live posts on X, 470K+ builders in r/LocalLLaMA, active Discord & Matrix rooms, and trending GitHub repos — all gathered in one hub.