How to read this page: every category explains the approach (how that modality actually works on local hardware), lists the best free/open-weight models, and tags each with the loaders that can run it. Sizes are approximate on-disk weight sizes at common quantizations.

Chat & general assistants

The bread-and-butter category. These are instruction-tuned language models you'd use for writing, brainstorming, Q&A, summarisation and as the reasoning core of most agents.

Approach:  Pull a quantized GGUF (commonly Q4_K_M or Q5_K_M) and run it on any llama.cpp-based loader. On Apple Silicon, MLX variants are ~30% faster. On NVIDIA, EXL2/AWQ + vLLM or ExLlamaV2 give the best throughput.
ModelSizeLicenseBest atRuns on
Llama 3.1 / 3.2 / 3.3Meta · 1B · 3B · 8B · 70B 1B–70B Llama license (free for most use) The default local assistant. Great balance of quality and speed.
Ollama Open WebUI llama.cpp vLLM MLX
Qwen 2.5 / Qwen 3Alibaba · 0.5B → 72B 0.5B–72B Apache 2.0 (most sizes) Strong multilingual, very strong at tool-use & JSON output.
Ollama Open WebUI llama.cpp vLLM MLX
Mistral / MixtralMistral AI · 7B · 8×7B · 8×22B 7B–141B Apache 2.0 MoE efficiency — fast responses with large-model quality.
Ollama Open WebUI llama.cpp vLLM
Gemma 2 / Gemma 3Google · 2B · 9B · 27B 2B–27B Gemma license (permissive) Punches above its weight, especially the 9B for consumer GPUs.
Ollama Open WebUI llama.cpp MLX
Phi-3.5 / Phi-4Microsoft · 3.8B · 14B 3.8B–14B MIT Tiny models that behave like big ones. Great on CPU.
Ollama Open WebUI llama.cpp
Hermes 3 / Nous HermesNous Research · 8B · 70B · 405B 8B–405B Llama license Best-in-class tool-calling, structured output & agent work.
Ollama Open WebUI llama.cpp vLLM
Command R / R+Cohere · 32B · 104B 32B–104B CC BY-NC (non-commercial) Long-context RAG & citations.
Ollama llama.cpp vLLM
Yi 1.5 · GLM-4 · DeepSeek V3Frontier open-weight releases 6B–671B Mixed (mostly permissive) For users with serious hardware — approaching GPT-4-class quality.
llama.cpp vLLM Workstation class

Coding models

Models fine-tuned on source code. Use them as local Copilot autocompletes, repo-aware refactorers, or the backend of tools like Continue, Aider and Open Interpreter.

Approach:  Most coding models ship as GGUF and run great under Ollama / Open WebUI. Pair the model with an IDE or CLI agent (Continue, Aider, Cline) that understands repo context. For fast autocomplete, pick a 1B–7B FIM-trained ("fill-in-the-middle") model.
ModelSizeLicenseBest atRuns on
Qwen 2.5 CoderAlibaba · 0.5B → 32B 0.5B–32B Apache 2.0 Current SOTA open code model. 32B rivals closed Copilot quality.
OllamaOpen WebUIllama.cppvLLM
DeepSeek Coder V2DeepSeek · 16B · 236B MoE 16B–236B Custom (permissive) Huge context, excellent at large-repo reasoning.
Ollamallama.cppvLLM
CodestralMistral AI · 22B 22B MNPL (non-prod free) Multi-language code completion & generation.
OllamaOpen WebUIllama.cpp
StarCoder 2BigCode · 3B · 7B · 15B 3B–15B BigCode OpenRAIL-M FIM autocomplete, 600+ languages.
Ollamallama.cppvLLM
CodeLlamaMeta · 7B · 13B · 34B · 70B 7B–70B Llama license Mature, well-supported, lots of specialist variants.
OllamaOpen WebUIllama.cpp
Granite CodeIBM · 3B · 8B · 20B · 34B 3B–34B Apache 2.0 Enterprise-friendly license, solid quality.
Ollamallama.cppvLLM
StableCode / Replit CodeStability · Replit 1.3B–3B Permissive Tiny autocomplete models — perfect for laptops.
Ollamallama.cpp

Reasoning & math models

A newer class of "think-before-you-speak" models that produce an internal chain-of-thought before an answer. They're slower but crush math, logic, planning and agent decisions.

Approach:  Reasoning models generate 2×–10× more tokens than a normal chat model because they "think" first. Budget for longer inference time. They work on any llama.cpp loader but vLLM's speculative decoding helps a lot on GPU.
ModelSizeLicenseBest atRuns on
DeepSeek-R1 / R1-DistillDeepSeek · 1.5B → 671B 1.5B–671B MIT Open reasoning breakthrough — distills down to 7B that beats GPT-4o on math.
OllamaOpen WebUIllama.cppvLLM
QwQ / Qwen-ReasoningAlibaba · 32B 32B Apache 2.0 Deep step-by-step analysis, excellent for agents.
Ollamallama.cppvLLM
Marco-o1 / OpenThinkerCommunity reasoning tunes 7B–32B Apache 2.0 Research-grade thinkers you can actually run on a consumer GPU.
Ollamallama.cpp
Mathstral / DeepSeek-MathSpecialist math models 7B Apache / Custom Symbolic maths, proofs, formal reasoning.
Ollamallama.cpp

Vision & multimodal models

Models that accept images (and sometimes video) alongside text. Essential for computer-use agents like OpenClaw, document understanding, OCR and visual Q&A.

Approach:  Vision models bundle a text LLM with a vision encoder (usually CLIP or SigLIP). Ollama and Open WebUI both support them natively — just paste or attach an image. For agentic "see the screen" workflows, pair them with OpenClaw or Open Interpreter OS-mode.
ModelSizeLicenseBest atRuns on
Llama 3.2 VisionMeta · 11B · 90B 11B–90B Llama license General-purpose image understanding & VQA.
OllamaOpen WebUIllama.cppvLLM
Qwen2-VL / Qwen2.5-VLAlibaba · 2B · 7B · 72B 2B–72B Apache 2.0 / Qwen license UI screenshots, OCR, charts, video frames. Top pick for agents.
OllamaOpen WebUIvLLM
LLaVA / LLaVA-NeXTCommunity · 7B · 13B · 34B 7B–34B Apache 2.0 (weights vary) The classic open vision-language model family.
OllamaOpen WebUIllama.cpp
PixtralMistral AI · 12B 12B Apache 2.0 High-quality image reasoning, strong at documents.
OllamavLLM
InternVL 2.5Shanghai AI Lab · 1B → 78B 1B–78B MIT (weights) Current open SOTA on most vision benchmarks.
vLLMllama.cpp
MiniCPM-V / Florence-2Edge vision models 0.2B–8B Apache 2.0 Tiny vision models for phones and IoT.
Ollamallama.cpp

Image generation

Diffusion and flow-matching models that create images from text prompts. Completely different plumbing from LLMs — different loaders, different file formats.

Approach:  Ollama and Open WebUI do not run image models. You need a diffusion-specific loader: ComfyUI (node graphs, most powerful), Automatic1111 / Forge (classic webui), InvokeAI (polished desktop app), SwarmUI (modern UI on top of Comfy), or Draw Things (macOS/iOS native). Weights come as .safetensors checkpoints plus optional LoRAs, VAEs and ControlNets.
ModelVRAM neededLicenseBest atRuns on
FLUX.1 [dev] / [schnell]Black Forest Labs · 12B 12–24 GB FLUX.1 non-commercial / Apache Current open SOTA. Stunning prompt adherence & text rendering.
ComfyUIForgeSwarmUIDraw Things
Stable Diffusion 3.5Stability AI · Medium · Large 8–16 GB Stability Community License Best supported ecosystem, huge LoRA catalog.
ComfyUIA1111InvokeAIForge
SDXL / SDXL TurboStability AI · 3.5B 6–12 GB OpenRAIL++ The workhorse. Huge community, countless fine-tunes.
ComfyUIA1111InvokeAIDraw Things
SD 1.5 & fine-tunesRealistic Vision, DreamShaper… 4–6 GB OpenRAIL / CreativeML Still unbeaten for specific artistic styles via fine-tunes.
ComfyUIA1111Forge
Playground v3 / KolorsCommunity flagships 10–16 GB Custom / Apache Distinctive aesthetics, great for commercial art.
ComfyUI
ControlNet · IP-Adapter · LoRAsNot models — add-ons ~100 MB each Mostly permissive Pose control, style transfer, subject consistency, identity.
ComfyUIA1111InvokeAI

Speech: transcription & text-to-speech

Models that turn audio into text (STT / ASR) and text back into natural-sounding voices (TTS). Local speech is now at or above cloud quality.

Approach:  STT typically runs via whisper.cpp (C++ port of OpenAI Whisper) or faster-whisper (CTranslate2). TTS uses its own engines — Piper, Coqui XTTS, F5-TTS, StyleTTS2. LocalAI wraps many of these into one OpenAI-compatible server.
ModelTypeLicenseBest atRuns on
Whisper v3 · distil-whisperOpenAI · tiny → large-v3 STT MIT 99-language transcription, the de-facto standard.
whisper.cppfaster-whisperLocalAI
Parakeet / CanaryNVIDIA NeMo STT CC-BY-4.0 Ultra-fast English transcription.
NeMofaster-whisper
Piperrhasspy · dozens of voices TTS MIT Lightning-fast, CPU-friendly TTS. Great for assistants.
PiperLocalAIHome Assistant
Coqui XTTS v2Coqui · voice cloning TTS CPML (non-commercial) Clones any voice from 6 seconds of audio.
Coqui TTSLocalAI
F5-TTS / StyleTTS 2Natural prosody TTS MIT / CC-BY-NC Extremely natural, expressive synthesis.
Native PythonComfyUI nodes
KokoroCompact all-in-one TTS TTS Apache 2.0 Tiny, fast, surprisingly good quality.
Native PythonLocalAI

Music & sound generation

From background loops to full songs with vocals — local music models are maturing fast.

Approach:  Music models run via Python notebooks, dedicated apps, or ComfyUI audio nodes. Generation is usually slower than real-time on consumer hardware, so you "bake" tracks rather than stream them.
ModelTypeLicenseBest atRuns on
Stable Audio OpenStability AIMusic / SFXStability Community47-second clips, sound effects, loops.
ComfyUINative Python
MusicGen / AudioGenMetaMusic / SFXCC-BY-NCText-to-music, melody-conditioned generation.
AudioCraftComfyUI
YuE / OpenMusicOpen song generatorsFull songsApache 2.0Multi-minute songs with vocals and structure.
Native Python
Barksuno-aiVoice / SFXMITExpressive voice acting, laughs, music snippets.
Native PythonComfyUI

Video generation

The newest — and most hardware-hungry — local modality. Expect seconds of video in exchange for minutes of GPU time.

Approach:  Every serious local video model runs through ComfyUI today. They demand heavy VRAM (16–48 GB) for high resolutions, but quantized GGUF versions are starting to land for consumer GPUs.
ModelVRAMLicenseBest atRuns on
HunyuanVideoTencent · 13B24–80 GBCustom (open)Current open SOTA — highly cinematic results.
ComfyUI
Wan 2.1 / CogVideoXCommunity video models12–24 GBApache 2.0Runs on a single 4090, great prompt adherence.
ComfyUI
LTX-VideoLightricks8–16 GBOpenRAILFast, near-real-time short video generation.
ComfyUI
Mochi 1 · AnimateDiffImage-to-video, animation8–24 GBApache 2.0Animating stills, consistent motion, loops.
ComfyUIA1111 ext.

Embeddings & rerankers

Not chat models — they turn text into vectors for semantic search, RAG and memory. Essential for any agent that needs to "remember".

Approach:  Embedding models are small (often < 1 GB) and run fast on CPU. Ollama and Open WebUI expose them via the /v1/embeddings endpoint. LocalAI and AnythingLLM wire them in for you.
ModelSizeLicenseBest atRuns on
nomic-embed-textNomic · 137M~275 MBApache 2.0Great default, 8k context, multilingual variant.
OllamaOpen WebUILocalAI
BGE-M3 / BGE-largeBAAI · multi-function~1.3 GBMITTop of MTEB; dense + sparse + ColBERT in one model.
Ollamallama.cppLocalAI
mxbai-embed-largeMixedbread~670 MBApache 2.0High quality for its size, great for English RAG.
Ollamallama.cpp
Jina Embeddings v3Jina AI~1.1 GBCC-BY-NCLong-context, task-LoRA-switchable.
Native PythonOllama (Q)
bge-reranker-v2-m3BAAI · cross-encoder~570 MBApache 2.0Reranker — huge quality boost on top of any embedder.
LocalAINative Python

Tiny & edge models

Sub-4B-parameter models that run well on CPUs, phones, Raspberry Pis, and modest laptops. Surprisingly capable — and perfect for always-on assistants.

Approach:  These models shine at short prompts and structured tasks. Run Q4 GGUFs on CPU through Ollama or llama.cpp. On iOS / Android use MLC-LLM, ExecuTorch or llama.cpp mobile builds.
ModelSizeLicenseBest atRuns on
Llama 3.2 1B / 3BMeta1B–3BLlama licenseBest all-round tiny chat model.
OllamaOpen WebUIllama.cppMLC-LLM
Qwen 2.5 0.5B / 1.5B / 3BAlibaba0.5B–3BApache 2.0Astonishing quality per parameter.
OllamaOpen WebUIllama.cpp
Phi-3.5-miniMicrosoft · 3.8B3.8BMITReasoning in a small package.
Ollamallama.cpp
Gemma 2 2BGoogle2BGemma licensePunchy on CPU, strong safety tuning.
Ollamallama.cpp
SmolLM 2Hugging Face · 135M · 360M · 1.7B0.1B–1.7BApache 2.0Microscopic assistants, runs on anything.
Ollamallama.cppExecuTorch
TinyLlama / MobileLLMMobile-first125M–1.1BApache 2.0On-device draft-models & smart-reply.
llama.cppMLC-LLMExecuTorch
Quick sizing guide

What can your machine actually run?

A rough rule of thumb for Q4-quantized GGUF models. Real-world numbers vary by quantization, context length and loader.

💻

8 GB RAM laptop (CPU only)

Up to ~3B chat (Llama 3.2 3B, Qwen 2.5 3B), tiny code models, embeddings, Whisper small, Piper TTS, SD 1.5 on CPU (slow).

Entry
🖥️

16 GB RAM / 8 GB VRAM

8B chat at good speed, 7B code, 7B vision, SDXL image gen, full Whisper large-v3, light ComfyUI workflows.

Sweet spot
🎮

RTX 4090 / M3 Max 64 GB

30B–70B chat at interactive speeds, FLUX.1, LTX-Video, DeepSeek-R1 Distill 32B, full agent stacks.

Prosumer
🏭

Multi-GPU workstation / server

Llama 3.1 405B, DeepSeek V3/R1 full, HunyuanVideo, vLLM serving a whole team — anything the open ecosystem ships.

Workstation

Models change weekly. This catalog changes with them.

Bookmark this page and check the comparison table for the right loader to pair with each model.

See all loaders compared → Back to runtimes

Join the global local-AI community

Live posts on X, 470K+ builders in r/LocalLLaMA, active Discord & Matrix rooms, and trending GitHub repos — all gathered in one hub.