Modern iPhones and Android flagships have enough compute to run 1B–8B parameter language models at conversational speed — fully offline, no account, no telemetry, and no bill. Here are the open-source apps that make it real, and the free open-weight models they run.
Install, download a model inside the app, put the phone in airplane mode — it still works. All the apps listed here are open source, have no server dependency, and do not ship prompts off the device.
llama.cpp, MLC-LLM or ExecuTorch) compiled for ARM with Metal / Vulkan / NNAPI acceleration. You download quantized weights (GGUF or MLC-compiled) once, then every token is computed on the phone's CPU + GPU + Neural Engine.
| App | Platform | License | Engine | Good for |
|---|---|---|---|---|
| PocketPal AIa-ghorbani/pocketpal-ai | iOS + Android | MIT | llama.cpp | The best-polished free option. Model picker, chat history, system prompts, benchmarks, side-by-side compare — all on-device. |
| MLC Chatmlc-ai/mlc-llm | iOS + Android | Apache 2.0 | MLC-LLM + TVM | Best raw speed on mobile GPUs (Metal / Vulkan). Runs Llama, Gemma, Phi, Mistral out of the box. |
| LLMFarmguinmoon/LLMFarm | iOS / macOS | MIT | llama.cpp + ggml | Veteran iOS app. Load any GGUF from Hugging Face, tune sampler params, use custom prompts. |
| Private LLMnumericcal (paid, source-avail) | iOS / macOS | Proprietary | Custom MLX/GGML | Polished UX, Shortcuts integration. Not open source — listed for completeness only. |
| Enchantedgluonfield/enchanted | iOS / macOS | Apache 2.0 | connects to Ollama | Not strictly on-device — it's a beautiful SwiftUI client for an Ollama server running on your desktop at home. Perfect pairing with a local runner. |
| Laylalayla-app.com (community build) | Android | Source-available | llama.cpp | Role-play / characters focused. Full offline, large model library, lorebook support. |
| MaidMobile-Artificial-Intelligence/maid | iOS + Android | MIT | llama.cpp (Flutter wrap) | Cross-platform Flutter app — loads GGUF locally or talks to a remote Ollama / OpenAI-compat server. |
| ChatterUIVali-98/ChatterUI | Android | AGPL-3.0 | llama.cpp (React Native) | Rich chat UI with character cards, local llama.cpp backend, optional remote APIs. |
| Termux + llama.cppDIY · termux.dev | Android | GPL / MIT | llama.cpp CLI | Compile llama-cli natively inside a Termux shell. Nerdy but 100% control, works on anything ARM64. |
| a-Shell + llama.cppholzschu/a-shell | iOS / iPadOS | BSD-3 | llama.cpp (WASM) | Terminal on iPhone. Slower (no Metal offload), but a fun "run an LLM in a shell on your phone" demo. |
Not every open-weight model will fit on a phone. Below are the ones that are small enough in Q4 quantization to load on 6–8 GB RAM devices and still produce useful output. All are free and on Hugging Face.
| Model | Size (Q4) | License | Why it's great on mobile | Runs in |
|---|---|---|---|---|
| Llama 3.2 1B & 3BMeta · mobile-focused release | ~0.8 / ~2.0 GB | Llama 3.2 license | Purpose-built by Meta for phones. The 3B is the current sweet spot — fluent chat, summarisation, tool-calling. | PocketPalMLC ChatLLMFarmMaid |
| Qwen 2.5 0.5B / 1.5B / 3BAlibaba | ~0.4–2.0 GB | Apache 2.0 | Strongest small-model lineage right now. 1.5B feels like a 3B from last year. | PocketPalMLC ChatLLMFarm |
| Phi-3.5 Mini (3.8B)Microsoft | ~2.4 GB | MIT | Designed specifically for "small enough to run on a phone, smart enough to be useful". Tool-calling is solid. | PocketPalMLC Chat |
| Gemma 2 2BGoogle | ~1.5 GB | Gemma license | Best "tiny" model for multilingual chat. Fast on Tensor G3/G4 NPU. | PocketPalMLC ChatMediaPipe |
| SmolLM2 135M / 360M / 1.7BHuggingFace | ~0.1–1.0 GB | Apache 2.0 | The smallest usable LLMs ever shipped. The 360M runs on a watch-class CPU. | PocketPalllama.cpp |
| TinyLlama 1.1BStatNLP Research | ~0.6 GB | Apache 2.0 | The classic tiny chat model. Fluent, fast, great for quick drafting and a fun first install. | PocketPalLLMFarm |
| Mistral 7B Instruct v0.3Mistral AI | ~4.1 GB | Apache 2.0 | The upper bound of "runs well on a flagship phone". Noticeably smarter than 3B but warms the device. | PocketPalMLC Chat8GB+ RAM |
| Llama 3.1 8B InstructMeta | ~4.7 GB | Llama 3.1 license | The biggest "mainstream" model that fits. Best quality you'll get on a 2024+ flagship. | PocketPalMLC ChatiPhone 15 Pro+ / S24+ |
| Qwen 2.5 Coder 1.5B / 3BAlibaba | ~1.0 / ~2.0 GB | Apache 2.0 | On-device code autocomplete in a Termux-based dev setup. Surprisingly capable. | PocketPalTermux |
| DeepSeek-R1 Distill Qwen 1.5BDeepSeek | ~1.1 GB | MIT | Yes — you can run a "reasoning" model on your phone. Slow (it thinks first) but smart. | PocketPalMLC Chat |
Beyond chat — phones can also run image generation, speech recognition, and text-to-speech fully offline. Expect slow generation but high privacy.
| Model / App | Platform | License | What it does | Notes |
|---|---|---|---|---|
| Draw Thingsdrawthings.ai | iOS / macOS | Freeware | Stable Diffusion / SDXL / Flux on-device | The reference iOS image-gen app. CoreML-accelerated, runs SDXL on iPhone 15 Pro in ~30s. |
| Local Diffusion / SD AIvarious OSS ports | Android | GPL / MIT | Stable Diffusion 1.5 on-device | Qualcomm Snapdragon 8 Gen 3+ can run SD 1.5 in ~10s using the QNN NPU path. |
| Whisper (small / base / tiny)OpenAI · ggerganov/whisper.cpp | iOS + Android | MIT | Speech-to-text transcription | Whisper.cpp runs on-device transcription at 2–5x realtime. Foundation of many voice-note apps. |
| Whisper-TurboOpenAI · 2024 release | iOS + Android | MIT | Fast multilingual STT | 8x smaller than large-v3, near-identical quality. Ideal for mobile dictation. |
| Piper TTSrhasspy/piper | iOS + Android | MIT | Neural text-to-speech | ~50MB voices, realtime synthesis on any ARM phone. Great for a private screen-reader. |
| MeloTTS / XTTS-streamingMyShell · Coqui | Android (experimental) | MIT / MPL | Higher-quality TTS / voice clone | Runs but slowly; more practical paired with a home server over Tailscale. |
A quick, honest sizing guide. Assumes Q4 GGUF unless noted. "Tok/s" figures are approximate prompt-free generation speed on moderate temperature sampling.
Smooth up to 1–3B models. Llama 3.2 1B at ~40 tok/s, Qwen 2.5 1.5B at ~25 tok/s. 7B models load but feel slow (<5 tok/s) and warm the phone fast.
The sweet spot. Llama 3.2 3B at ~25 tok/s, Mistral 7B at 8–12 tok/s via MLC-LLM Metal backend. Apple Neural Engine used heavily when the app supports CoreML.
Best Android tier. Runs Llama 3.1 8B at 6–10 tok/s, 3B at 25–35 tok/s. Snapdragon 8 Gen 3/4 QNN path unlocks NPU offload in supported apps.
Stick to <2B models. TinyLlama, SmolLM2 1.7B, Qwen 2.5 0.5B all work. Anything bigger will swap and crawl.
Open-source SDKs that let you embed a local LLM into your iOS or Android app. All free, all runnable fully offline.
| SDK | Platform | Language | License | Why pick it |
|---|---|---|---|---|
| llama.cppggerganov/llama.cpp | iOS + Android | C / C++ (bindings for Swift, Kotlin, JS, Python) | MIT | The universal local-LLM engine. Prebuilt Metal (iOS) and Vulkan (Android) backends, active releases weekly. |
| MLC-LLMmlc-ai/mlc-llm | iOS + Android | Swift / Kotlin / JS | Apache 2.0 | TVM-compiled models = fastest GPU path on mobile. Great for apps that ship one fixed model. |
| ExecuTorchpytorch/executorch | iOS + Android | C++ / Swift / Kotlin | BSD-3 | PyTorch team's on-device runtime. Designed for Llama 3.2 mobile deployment with hardware-partitioned compute. |
| MediaPipe LLM Inferencegoogle-ai-edge/mediapipe | iOS + Android | Swift / Kotlin / JS | Apache 2.0 | Google's drop-in API for running Gemma and friends. Simple, well documented, great first SDK. |
| Apple CoreML + MLXml-explore/mlx-swift | iOS / macOS | Swift | MIT / Apple | Tightest Apple Silicon integration. MLX Swift lets you run MLX-quantized models natively on Neural Engine. |
| ONNX Runtime Mobilemicrosoft/onnxruntime | iOS + Android | C++ / Swift / Kotlin | MIT | If your model is already in ONNX, this is the fastest path. Especially strong for small classification / embedding models. |
| Qualcomm AI Hubaihub.qualcomm.com | Android (Snapdragon) | Native | Free tier (models OSS) | Pre-optimised open-weight models targeting Snapdragon NPU. Big speedups but Snapdragon-only. |
The questions we're asked most often about running models on a phone.
Yes. Once you've downloaded the app and the model file, the whole inference loop runs on the phone's CPU / GPU / NPU. Airplane mode has zero impact on quality or speed — a common way to prove it to yourself.
A 5-minute conversation with a 3B model uses roughly the same battery as 5 minutes of 4K video recording. Fine for occasional use; don't leave it generating in a loop.
If the app is open source and has no network permission requests, yes — prompts never leave the device. Check the app's privacy label / manifest. Every app listed on this page is auditable.
You're running a ~3B parameter model instead of a ~1T one. The gap is real. For private chat, summarisation, quick code snippets, tool-calling — on-device is already enough. For "teach me quantum physics from scratch" — pair your phone with a home server running Ollama via Tailscale.
Absolutely — and it's the best of both worlds. Run Ollama or Open WebUI on your desktop, install Tailscale, then use the Enchanted or Maid app on your phone. You get 70B-class quality with phone-class convenience, zero cloud dependency.
Those are closed-source, tied to specific OS versions, and limited in what they'll answer. The apps on this page are open-source alternatives that work on any phone, with any model you choose, with no usage limits.
Install an app, download a model, put the phone in airplane mode — you're running production-grade AI on a device in your pocket, with zero dependency on anyone else's servers.