Reality check: your phone is not a laptop. Expect 1B–3B models to feel instant, 7–8B models to work but warm up the device, and anything bigger than ~10GB on disk to simply not fit. Speeds below are typical on a 2024–2025 flagship (iPhone 15 Pro / Pixel 9 Pro / Galaxy S25). All the apps and models on this page are free and open source.
📱

Mobile apps that run LLMs on-device

Install, download a model inside the app, put the phone in airplane mode — it still works. All the apps listed here are open source, have no server dependency, and do not ship prompts off the device.

How they work:  these apps ship a tiny inference engine (usually llama.cpp, MLC-LLM or ExecuTorch) compiled for ARM with Metal / Vulkan / NNAPI acceleration. You download quantized weights (GGUF or MLC-compiled) once, then every token is computed on the phone's CPU + GPU + Neural Engine.
AppPlatformLicenseEngineGood for
PocketPal AIa-ghorbani/pocketpal-ai iOS + Android MIT llama.cpp The best-polished free option. Model picker, chat history, system prompts, benchmarks, side-by-side compare — all on-device.
MLC Chatmlc-ai/mlc-llm iOS + Android Apache 2.0 MLC-LLM + TVM Best raw speed on mobile GPUs (Metal / Vulkan). Runs Llama, Gemma, Phi, Mistral out of the box.
LLMFarmguinmoon/LLMFarm iOS / macOS MIT llama.cpp + ggml Veteran iOS app. Load any GGUF from Hugging Face, tune sampler params, use custom prompts.
Private LLMnumericcal (paid, source-avail) iOS / macOS Proprietary Custom MLX/GGML Polished UX, Shortcuts integration. Not open source — listed for completeness only.
Enchantedgluonfield/enchanted iOS / macOS Apache 2.0 connects to Ollama Not strictly on-device — it's a beautiful SwiftUI client for an Ollama server running on your desktop at home. Perfect pairing with a local runner.
Laylalayla-app.com (community build) Android Source-available llama.cpp Role-play / characters focused. Full offline, large model library, lorebook support.
MaidMobile-Artificial-Intelligence/maid iOS + Android MIT llama.cpp (Flutter wrap) Cross-platform Flutter app — loads GGUF locally or talks to a remote Ollama / OpenAI-compat server.
ChatterUIVali-98/ChatterUI Android AGPL-3.0 llama.cpp (React Native) Rich chat UI with character cards, local llama.cpp backend, optional remote APIs.
Termux + llama.cppDIY · termux.dev Android GPL / MIT llama.cpp CLI Compile llama-cli natively inside a Termux shell. Nerdy but 100% control, works on anything ARM64.
a-Shell + llama.cppholzschu/a-shell iOS / iPadOS BSD-3 llama.cpp (WASM) Terminal on iPhone. Slower (no Metal offload), but a fun "run an LLM in a shell on your phone" demo.
🧠

Models that actually run on a phone

Not every open-weight model will fit on a phone. Below are the ones that are small enough in Q4 quantization to load on 6–8 GB RAM devices and still produce useful output. All are free and on Hugging Face.

Approach:  pick a Q4_K_M or Q4_0 GGUF (smallest decent quality). On iPhone / Pixel flagships, 3B models run at 15–30 tokens/sec, 7–8B at 6–12 tok/s. Below 1.5B, expect "instant" but simpler reasoning. Memory pressure matters more than raw model size — close other apps.
ModelSize (Q4)LicenseWhy it's great on mobileRuns in
Llama 3.2 1B & 3BMeta · mobile-focused release ~0.8 / ~2.0 GB Llama 3.2 license Purpose-built by Meta for phones. The 3B is the current sweet spot — fluent chat, summarisation, tool-calling.
PocketPalMLC ChatLLMFarmMaid
Qwen 2.5 0.5B / 1.5B / 3BAlibaba ~0.4–2.0 GB Apache 2.0 Strongest small-model lineage right now. 1.5B feels like a 3B from last year.
PocketPalMLC ChatLLMFarm
Phi-3.5 Mini (3.8B)Microsoft ~2.4 GB MIT Designed specifically for "small enough to run on a phone, smart enough to be useful". Tool-calling is solid.
PocketPalMLC Chat
Gemma 2 2BGoogle ~1.5 GB Gemma license Best "tiny" model for multilingual chat. Fast on Tensor G3/G4 NPU.
PocketPalMLC ChatMediaPipe
SmolLM2 135M / 360M / 1.7BHuggingFace ~0.1–1.0 GB Apache 2.0 The smallest usable LLMs ever shipped. The 360M runs on a watch-class CPU.
PocketPalllama.cpp
TinyLlama 1.1BStatNLP Research ~0.6 GB Apache 2.0 The classic tiny chat model. Fluent, fast, great for quick drafting and a fun first install.
PocketPalLLMFarm
Mistral 7B Instruct v0.3Mistral AI ~4.1 GB Apache 2.0 The upper bound of "runs well on a flagship phone". Noticeably smarter than 3B but warms the device.
PocketPalMLC Chat8GB+ RAM
Llama 3.1 8B InstructMeta ~4.7 GB Llama 3.1 license The biggest "mainstream" model that fits. Best quality you'll get on a 2024+ flagship.
PocketPalMLC ChatiPhone 15 Pro+ / S24+
Qwen 2.5 Coder 1.5B / 3BAlibaba ~1.0 / ~2.0 GB Apache 2.0 On-device code autocomplete in a Termux-based dev setup. Surprisingly capable.
PocketPalTermux
DeepSeek-R1 Distill Qwen 1.5BDeepSeek ~1.1 GB MIT Yes — you can run a "reasoning" model on your phone. Slow (it thinks first) but smart.
PocketPalMLC Chat
🎨

Image & audio models on mobile

Beyond chat — phones can also run image generation, speech recognition, and text-to-speech fully offline. Expect slow generation but high privacy.

Approach:  image diffusion is the slowest modality on mobile (10–60 seconds per image on flagships). Whisper transcription and Piper TTS are both near-realtime. CoreML (iOS) and NNAPI / QNN (Android) unlock large speedups when the app supports them.
Model / AppPlatformLicenseWhat it doesNotes
Draw Thingsdrawthings.ai iOS / macOS Freeware Stable Diffusion / SDXL / Flux on-device The reference iOS image-gen app. CoreML-accelerated, runs SDXL on iPhone 15 Pro in ~30s.
Local Diffusion / SD AIvarious OSS ports Android GPL / MIT Stable Diffusion 1.5 on-device Qualcomm Snapdragon 8 Gen 3+ can run SD 1.5 in ~10s using the QNN NPU path.
Whisper (small / base / tiny)OpenAI · ggerganov/whisper.cpp iOS + Android MIT Speech-to-text transcription Whisper.cpp runs on-device transcription at 2–5x realtime. Foundation of many voice-note apps.
Whisper-TurboOpenAI · 2024 release iOS + Android MIT Fast multilingual STT 8x smaller than large-v3, near-identical quality. Ideal for mobile dictation.
Piper TTSrhasspy/piper iOS + Android MIT Neural text-to-speech ~50MB voices, realtime synthesis on any ARM phone. Great for a private screen-reader.
MeloTTS / XTTS-streamingMyShell · Coqui Android (experimental) MIT / MPL Higher-quality TTS / voice clone Runs but slowly; more practical paired with a home server over Tailscale.
⚙️

What runs on which phone

A quick, honest sizing guide. Assumes Q4 GGUF unless noted. "Tok/s" figures are approximate prompt-free generation speed on moderate temperature sampling.

🍏

iPhone 12–14 / 6–8 GB RAM

Smooth up to 1–3B models. Llama 3.2 1B at ~40 tok/s, Qwen 2.5 1.5B at ~25 tok/s. 7B models load but feel slow (<5 tok/s) and warm the phone fast.

Entry
🚀

iPhone 15 Pro / 16 Pro / 8 GB RAM

The sweet spot. Llama 3.2 3B at ~25 tok/s, Mistral 7B at 8–12 tok/s via MLC-LLM Metal backend. Apple Neural Engine used heavily when the app supports CoreML.

Sweet spot
🤖

Pixel 9 Pro / Galaxy S25 / 12 GB RAM

Best Android tier. Runs Llama 3.1 8B at 6–10 tok/s, 3B at 25–35 tok/s. Snapdragon 8 Gen 3/4 QNN path unlocks NPU offload in supported apps.

Flagship
📟

Mid-range Android / 4–6 GB RAM

Stick to <2B models. TinyLlama, SmolLM2 1.7B, Qwen 2.5 0.5B all work. Anything bigger will swap and crawl.

Budget
🛠️

Build your own mobile AI app

Open-source SDKs that let you embed a local LLM into your iOS or Android app. All free, all runnable fully offline.

SDKPlatformLanguageLicenseWhy pick it
llama.cppggerganov/llama.cpp iOS + Android C / C++ (bindings for Swift, Kotlin, JS, Python) MIT The universal local-LLM engine. Prebuilt Metal (iOS) and Vulkan (Android) backends, active releases weekly.
MLC-LLMmlc-ai/mlc-llm iOS + Android Swift / Kotlin / JS Apache 2.0 TVM-compiled models = fastest GPU path on mobile. Great for apps that ship one fixed model.
ExecuTorchpytorch/executorch iOS + Android C++ / Swift / Kotlin BSD-3 PyTorch team's on-device runtime. Designed for Llama 3.2 mobile deployment with hardware-partitioned compute.
MediaPipe LLM Inferencegoogle-ai-edge/mediapipe iOS + Android Swift / Kotlin / JS Apache 2.0 Google's drop-in API for running Gemma and friends. Simple, well documented, great first SDK.
Apple CoreML + MLXml-explore/mlx-swift iOS / macOS Swift MIT / Apple Tightest Apple Silicon integration. MLX Swift lets you run MLX-quantized models natively on Neural Engine.
ONNX Runtime Mobilemicrosoft/onnxruntime iOS + Android C++ / Swift / Kotlin MIT If your model is already in ONNX, this is the fastest path. Especially strong for small classification / embedding models.
Qualcomm AI Hubaihub.qualcomm.com Android (Snapdragon) Native Free tier (models OSS) Pre-optimised open-weight models targeting Snapdragon NPU. Big speedups but Snapdragon-only.

Mobile AI FAQ

The questions we're asked most often about running models on a phone.

Does it really work offline?

Yes. Once you've downloaded the app and the model file, the whole inference loop runs on the phone's CPU / GPU / NPU. Airplane mode has zero impact on quality or speed — a common way to prove it to yourself.

Will it drain my battery?

A 5-minute conversation with a 3B model uses roughly the same battery as 5 minutes of 4K video recording. Fine for occasional use; don't leave it generating in a loop.

Is it private?

If the app is open source and has no network permission requests, yes — prompts never leave the device. Check the app's privacy label / manifest. Every app listed on this page is auditable.

Why isn't it as smart as ChatGPT?

You're running a ~3B parameter model instead of a ~1T one. The gap is real. For private chat, summarisation, quick code snippets, tool-calling — on-device is already enough. For "teach me quantum physics from scratch" — pair your phone with a home server running Ollama via Tailscale.

Can I use my home PC from my phone?

Absolutely — and it's the best of both worlds. Run Ollama or Open WebUI on your desktop, install Tailscale, then use the Enchanted or Maid app on your phone. You get 70B-class quality with phone-class convenience, zero cloud dependency.

What about Apple Intelligence / Gemini Nano?

Those are closed-source, tied to specific OS versions, and limited in what they'll answer. The apps on this page are open-source alternatives that work on any phone, with any model you choose, with no usage limits.

Your phone is already a private AI device.

Install an app, download a model, put the phone in airplane mode — you're running production-grade AI on a device in your pocket, with zero dependency on anyone else's servers.

See the apps → Browse all models

Chat with other mobile-AI builders

Swap benchmarks, troubleshoot model loads, show off your on-device setups — with 40K+ local-AI enthusiasts.