Mobile AI — Run LLMs Locally on iPhone & Android (2026)

📱

Mobile apps that run LLMs on-device

Install, download a model inside the app, put the phone in airplane mode — it still works. All the apps listed here are open source, have no server dependency, and do not ship prompts off the device.

How they work: these apps ship a tiny inference engine (usually llama.cpp, MLC-LLM or ExecuTorch) compiled for ARM with Metal / Vulkan / NNAPI acceleration. You download quantized weights (GGUF or MLC-compiled) once, then every token is computed on the phone's CPU + GPU + Neural Engine.

App	Platform	License	Engine	Good for
PocketPal AIa-ghorbani/pocketpal-ai	iOS + Android	MIT	llama.cpp	The best-polished free option. Model picker, chat history, system prompts, benchmarks, side-by-side compare — all on-device.
MLC Chatmlc-ai/mlc-llm	iOS + Android	Apache 2.0	MLC-LLM + TVM	Best raw speed on mobile GPUs (Metal / Vulkan). Runs Llama, Gemma, Phi, Mistral out of the box.
LLMFarmguinmoon/LLMFarm	iOS / macOS	MIT	llama.cpp + ggml	Veteran iOS app. Load any GGUF from Hugging Face, tune sampler params, use custom prompts.
Private LLMnumericcal (paid, source-avail)	iOS / macOS	Proprietary	Custom MLX/GGML	Polished UX, Shortcuts integration. Not open source — listed for completeness only.
Enchantedgluonfield/enchanted	iOS / macOS	Apache 2.0	connects to Ollama	Not strictly on-device — it's a beautiful SwiftUI client for an Ollama server running on your desktop at home. Perfect pairing with a local runner.
Laylalayla-app.com (community build)	Android	Source-available	llama.cpp	Role-play / characters focused. Full offline, large model library, lorebook support.
MaidMobile-Artificial-Intelligence/maid	iOS + Android	MIT	llama.cpp (Flutter wrap)	Cross-platform Flutter app — loads GGUF locally or talks to a remote Ollama / OpenAI-compat server.
ChatterUIVali-98/ChatterUI	Android	AGPL-3.0	llama.cpp (React Native)	Rich chat UI with character cards, local llama.cpp backend, optional remote APIs.
Termux + llama.cppDIY · termux.dev	Android	GPL / MIT	llama.cpp CLI	Compile `llama-cli` natively inside a Termux shell. Nerdy but 100% control, works on anything ARM64.
a-Shell + llama.cppholzschu/a-shell	iOS / iPadOS	BSD-3	llama.cpp (WASM)	Terminal on iPhone. Slower (no Metal offload), but a fun "run an LLM in a shell on your phone" demo.

🧠

Models that actually run on a phone

Not every open-weight model will fit on a phone. Below are the ones that are small enough in Q4 quantization to load on 6–8 GB RAM devices and still produce useful output. All are free and on Hugging Face.

Approach: pick a Q4_K_M or Q4_0 GGUF (smallest decent quality). On iPhone / Pixel flagships, 3B models run at 15–30 tokens/sec, 7–8B at 6–12 tok/s. Below 1.5B, expect "instant" but simpler reasoning. Memory pressure matters more than raw model size — close other apps.

Model	Size (Q4)	License	Why it's great on mobile	Runs in
Llama 3.2 1B & 3BMeta · mobile-focused release	~0.8 / ~2.0 GB	Llama 3.2 license	Purpose-built by Meta for phones. The 3B is the current sweet spot — fluent chat, summarisation, tool-calling.	PocketPalMLC ChatLLMFarmMaid
Qwen 3 0.5B / 1.5B / 3BAlibaba	~0.4–2.0 GB	Apache 2.0	Strongest small-model lineage right now. 1.5B feels like a 3B from last year.	PocketPalMLC ChatLLMFarm
Phi-4-mini (3.8B)Microsoft	~2.4 GB	MIT	Designed specifically for "small enough to run on a phone, smart enough to be useful". Tool-calling is solid.	PocketPalMLC Chat
Gemma 4 2BGoogle · vision-capable	~1.5 GB	Gemma license	2026 — best "tiny" model for multilingual chat & on-device vision. Fast on Tensor G4/G5 NPU.	PocketPalMLC ChatMediaPipe
SmolLM2 135M / 360M / 1.7BHuggingFace	~0.1–1.0 GB	Apache 2.0	The smallest usable LLMs ever shipped. The 360M runs on a watch-class CPU.	PocketPalllama.cpp
TinyLlama 1.1BStatNLP Research	~0.6 GB	Apache 2.0	The classic tiny chat model. Fluent, fast, great for quick drafting and a fun first install.	PocketPalLLMFarm
Mistral 7B Instruct v0.3Mistral AI	~4.1 GB	Apache 2.0	The upper bound of "runs well on a flagship phone". Noticeably smarter than 3B but warms the device.	PocketPalMLC Chat8GB+ RAM
Llama 3.1 8B InstructMeta	~4.7 GB	Llama 3.1 license	The biggest "mainstream" model that fits. Best quality you'll get on a 2024+ flagship.	PocketPalMLC ChatiPhone 15 Pro+ / S24+
Qwen3 Coder 1.5B / 3BAlibaba	~1.0 / ~2.0 GB	Apache 2.0	On-device code autocomplete in a Termux-based dev setup. Surprisingly capable.	PocketPalTermux
DeepSeek-R1 Distill Qwen 1.5BDeepSeek	~1.1 GB	MIT	Yes — you can run a "reasoning" model on your phone. Slow (it thinks first) but smart.	PocketPalMLC Chat

🎨

Image & audio models on mobile

Beyond chat — phones can also run image generation, speech recognition, and text-to-speech fully offline. Expect slow generation but high privacy.

Approach: image diffusion is the slowest modality on mobile (10–60 seconds per image on flagships). Whisper transcription and Piper TTS are both near-realtime. CoreML (iOS) and NNAPI / QNN (Android) unlock large speedups when the app supports them.

Model / App	Platform	License	What it does	Notes
Draw Thingsdrawthings.ai	iOS / macOS	Freeware	Stable Diffusion / SDXL / Flux on-device	The reference iOS image-gen app. CoreML-accelerated, runs SDXL on iPhone 15 Pro in ~30s.
Local Diffusion / SD AIvarious OSS ports	Android	GPL / MIT	Stable Diffusion 1.5 on-device	Qualcomm Snapdragon 8 Gen 3+ can run SD 1.5 in ~10s using the QNN NPU path.
Whisper (small / base / tiny)OpenAI · ggerganov/whisper.cpp	iOS + Android	MIT	Speech-to-text transcription	Whisper.cpp runs on-device transcription at 2–5x realtime. Foundation of many voice-note apps.
Whisper-TurboOpenAI · 2024 release	iOS + Android	MIT	Fast multilingual STT	8x smaller than large-v3, near-identical quality. Ideal for mobile dictation.
Piper TTSrhasspy/piper	iOS + Android	MIT	Neural text-to-speech	~50MB voices, realtime synthesis on any ARM phone. Great for a private screen-reader.
MeloTTS / XTTS-streamingMyShell · Coqui	Android (experimental)	MIT / MPL	Higher-quality TTS / voice clone	Runs but slowly; more practical paired with a home server over Tailscale.

⚙️

What runs on which phone

A quick, honest sizing guide. Assumes Q4 GGUF unless noted. "Tok/s" figures are approximate prompt-free generation speed on moderate temperature sampling.

🍏

iPhone 12–14 / 6–8 GB RAM

Smooth up to 1–3B models. Llama 3.2 1B at ~40 tok/s, Qwen 2.5 1.5B at ~25 tok/s. 7B models load but feel slow (<5 tok/s) and warm the phone fast.

Entry

🚀

iPhone 15 Pro / 16 Pro / 8 GB RAM

The sweet spot. Llama 3.2 3B at ~25 tok/s, Mistral 7B at 8–12 tok/s via MLC-LLM Metal backend. Apple Neural Engine used heavily when the app supports CoreML.

Sweet spot

🤖

Pixel 9 Pro / Galaxy S25 / 12 GB RAM

Best Android tier. Runs Llama 3.1 8B at 6–10 tok/s, 3B at 25–35 tok/s. Snapdragon 8 Gen 3/4 QNN path unlocks NPU offload in supported apps.

Flagship

📟

Mid-range Android / 4–6 GB RAM

Stick to <2B models. TinyLlama, SmolLM2 1.7B, Qwen 2.5 0.5B all work. Anything bigger will swap and crawl.

Budget

🛠️

Build your own mobile AI app

Open-source SDKs that let you embed a local LLM into your iOS or Android app. All free, all runnable fully offline.

SDK	Platform	Language	License	Why pick it
llama.cppggerganov/llama.cpp	iOS + Android	C / C++ (bindings for Swift, Kotlin, JS, Python)	MIT	The universal local-LLM engine. Prebuilt Metal (iOS) and Vulkan (Android) backends, active releases weekly.
MLC-LLMmlc-ai/mlc-llm	iOS + Android	Swift / Kotlin / JS	Apache 2.0	TVM-compiled models = fastest GPU path on mobile. Great for apps that ship one fixed model.
ExecuTorchpytorch/executorch	iOS + Android	C++ / Swift / Kotlin	BSD-3	PyTorch team's on-device runtime. Designed for Llama 3.2 mobile deployment with hardware-partitioned compute.
MediaPipe LLM Inferencegoogle-ai-edge/mediapipe	iOS + Android	Swift / Kotlin / JS	Apache 2.0	Google's drop-in API for running Gemma and friends. Simple, well documented, great first SDK.
Apple CoreML + MLXml-explore/mlx-swift	iOS / macOS	Swift	MIT / Apple	Tightest Apple Silicon integration. MLX Swift lets you run MLX-quantized models natively on Neural Engine.
ONNX Runtime Mobilemicrosoft/onnxruntime	iOS + Android	C++ / Swift / Kotlin	MIT	If your model is already in ONNX, this is the fastest path. Especially strong for small classification / embedding models.
Qualcomm AI Hubaihub.qualcomm.com	Android (Snapdragon)	Native	Free tier (models OSS)	Pre-optimised open-weight models targeting Snapdragon NPU. Big speedups but Snapdragon-only.

❓

Mobile AI FAQ

The questions we're asked most often about running models on a phone.

Does it really work offline?

Yes. Once you've downloaded the app and the model file, the whole inference loop runs on the phone's CPU / GPU / NPU. Airplane mode has zero impact on quality or speed — a common way to prove it to yourself.

Will it drain my battery?

A 5-minute conversation with a 3B model uses roughly the same battery as 5 minutes of 4K video recording. Fine for occasional use; don't leave it generating in a loop.

Is it private?

If the app is open source and has no network permission requests, yes — prompts never leave the device. Check the app's privacy label / manifest. Every app listed on this page is auditable.

Why isn't it as smart as ChatGPT?

You're running a ~3B parameter model instead of a ~1T one. The gap is real. For private chat, summarisation, quick code snippets, tool-calling — on-device is already enough. For "teach me quantum physics from scratch" — pair your phone with a home server running Ollama via Tailscale.

Can I use my home PC from my phone?

Absolutely — and it's the best of both worlds. Run Ollama or Open WebUI on your desktop, install Tailscale, then use the Enchanted or Maid app on your phone. You get 70B-class quality with phone-class convenience, zero cloud dependency.

What about Apple Intelligence / Gemini Nano?

Those are closed-source, tied to specific OS versions, and limited in what they'll answer. The apps on this page are open-source alternatives that work on any phone, with any model you choose, with no usage limits.

AI on your phone, with no cloud involved