The Case for Local AI Models

NativeLLM Dev Meetup, Buenos Aires, July 2025

For the last couple of years, I’ve been quietly obsessed with small, local AI models. Not the trillion-parameter behemoths we access through the cloud—but lean, fast, efficient models that run on devices you already own. At the recent NativeLLM Dev Meetup in Buenos Aires, I finally had the chance to share that obsession.

This post is a recap and expansion of that quick 15 minute talk—why I believe local models matter, how far we’ve come, and what this means for the future of AI development.

Cloud AI Is Overkill for Most Use Cases

We’ve grown too comfortable reaching for curl to interact with remote LLM APIs. And sure, the results are impressive—but for many tasks, it's like firing up a rocket to cross the street.

Summarizing a paragraph, parsing a JSON payload, classifying an email—these don’t need GPT-4-level compute. Yet every API call routes through cloud infrastructure powered by 1T+ parameter models.

That’s massive overkill.

A Simple Premise: You Don’t Need the Cloud

Let’s start with the basic challenge that shaped this talk:

LLM Talk.jpg

The assumption that great AI experiences require remote infrastructure is outdated. Local models can deliver excellent performance, and they come with serious advantages:

Latency: No network hop. Local inference is instant.
Cost: Once downloaded, inference is free.
Privacy: Your data never leaves the device.
Simplicity: No infra, no API keys, no vendor dependencies.
Reasoning: For many real-world cases, local models are already “good enough.”

You can’t beat the speed of light—and you shouldn’t have to pay per token for tasks your phone can handle in milliseconds.

LLM Talk2.jpg

Distillation: Small Models, Big Brains

So how can small models compete? One of the answer lies in distillation—a process that transfers the “knowledge” of a large model (like GPT-4) into a smaller, more efficient one.

Models like DeepSeek-R1 have distilled GPT-o1-level reasoning into just 7 billion parameters. And the results are surprising: solid performance on tasks like reasoning, summarization, and entity extraction. It’s not about matching the biggest models—it’s about doing enough, fast and locally.

LLM Talk3.jpg

Demonstrating Capability

To illustrate this, I showed prompts around:

JSON parsing
Summarization
Multi-step reasoning
Entity recognition

LLM Talk4.jpg LLM Talk5.jpg

LLM Talk6.jpg LLM Talk7.jpg LLM Talk8.jpg

The output quality from local models like DeepSeek-7B was close enough to the cloud that, in most contexts, the difference was negligible. Especially when considering the zero latency, zero cost, and full control you get in return.

LLM Talk12.jpg

Hardware Is Already There

This shift to local AI isn't just about model architecture. It’s about the hardware evolution that’s made it possible.

Take Apple Silicon as an example:

M4 Max has 40-core GPUs in laptops and even iPads.
iPhone 16 Pro includes a 6-core CPU and a 16-core Neural Engine capable of 35 trillion operations per second.
Shared memory model eliminates unnecessary data copies.
546 GB/s memory bandwidth an insane amount.
Foundation Model (3B) from Apple runs natively with no setup, no downloads, and is shared across apps.

LLM Talk15.jpg

LLM Talk16.jpg

These aren’t theoretical claims—these are shipping devices. The compute is already in your pocket or on your desk.

The Local AI Stack

The software landscape has evolved rapidly to match this new wave of hardware:

Ollama

The simplest way to get started. Run LLMs like gemma3n or deepseek-7b locally with a single command.
Provides a local OpenAI-compatible API, handles downloads, warmups, and runs completely offline.

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3n",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Hello!" }
    ]
  }'

I’ve run these models on planes—no internet, no problem.

Apple Foundation Models

At WWDC, Apple dropped a surprise: native, system-level 3B language models baked into iOS and macOS. No downloads, shared across apps, instant inference.

They also shipped SpeechAnalyzer, a built-in speech-to-text framework comparable to Whisper—again, fully local.

MLX

For those who want low-level access and speed, MLX is Apple’s Metal-based inference stack. It supports quantized weights, fine-tuned models, embeddings, and more. I built a CLI tool using MLX that does full inference and retrieval on-device using Qwen3-4B-4bit.

Local Models Enable New User Experiences

With zero marginal cost, we’re entering a new era of UX. Here are a few projects I’ve been working on:

Sleep Coach

On-device coaching app that analyzes your sleep patterns and speaks to you like a real assistant. No backend, no cloud—just your device, HealthKit and the Foundation Model.

LLM Talk23.jpg

Rebound Browser

A browser that turns your web browsing into AI-embedded memory. It indexes, embeds, and stores data locally—so you can ask later, “What did I read about vector databases last week?” via voice commands. Works offline, and keeps your data private.

Screenshot 2025-07-13 at 6.58.03 PM.png

Rebound Assistant

A local speech-driven companion:

Speech-to-text via SpeechAnalyzer
Contextual embeddings and inference using Foundation Model
Slack and clipboard integration
All local, all private

Screenshot 2025-07-13 at 6.58.16 PM.png

Why Local Wins

There’s a deeper shift happening here. When inference is free, fast, and local:

You design differently.
You build differently.
You think beyond server costs and API quotas.

And with open licensing (e.g., DeepSeek under MIT), you’re free to build, ship, and iterate without constraints.

This Has All Happened Before

We’ve seen this before.

Mainframes gave way to personal computers.
Cloud gave way to mobile-first.
Now: Cloud AI is giving way to edge-first intelligence.

History repeats. Local models aren’t just a technical curiosity—they’re the next frontier in building software.

LLM Talk29.jpg

Final Thoughts

Local models offer comparable performance for most real-world use cases. The hardware is here. The tooling is here. And the opportunity to design new, unconstrained experiences is wide open.

If you start exploring now, you’ll be ahead of the curve.

Thanks for reading.