Gemma 4 + Ollama: Run Google's #3 Open Model Free on Your Machine

Set up Gemma 4 locally with Ollama in under 10 minutes. Free, open-source, runs on 8GB+ RAM. Includes model size guide and OpenClaw integration.

Google just dropped Gemma 4 — and the timing couldn’t be better. Two days after launch, it’s already ranked #3 among open models globally on the Arena AI leaderboard. It runs locally on your own hardware. It costs nothing. And it arrived the exact same week Anthropic cut OpenClaw users off from Claude subscriptions.

The developer reaction has been intense. One post with 2,400+ likes told everyone to “DROP EVERYTHING” and run ollama run gemma4. The creator of llama.cpp showed it hitting 300 tokens per second on a three-year-old Mac Studio. And multiple developers are already running OpenClaw with Gemma 4 on MacBook Airs — completely free.

Coincidence? Sure. But a useful one.

Here’s how to get Gemma 4 running on your machine with Ollama in under 10 minutes — plus which model size to pick for your hardware, how to connect it to OpenClaw, and where it actually falls short compared to cloud APIs.


What Is Gemma 4?

Gemma 4 is Google DeepMind’s open-source AI model family, built from the same research that powers Gemini 3. It launched April 2-3, 2026 under the Apache 2.0 license — which means you can use it for anything, including commercial projects, with zero licensing fees.

Four sizes are available:

ModelParametersBest ForMin RAM
Gemma 4 E2B~2BMobile, IoT, edge devices4GB
Gemma 4 E4B~4B (default)Laptops, quick tasks8GB
Gemma 4 26B MoE26B (3.8B active)Sweet spot — quality of a 13B model, speed of a 4B20GB
Gemma 4 31B Dense31BMaximum quality, serious hardware24GB+

The “MoE” in the 26B model stands for Mixture of Experts — it has 26 billion parameters total, but only activates 3.8 billion per response. So you get near-13B quality at 4B speed. That’s the one most people should try first.

The 31B Dense model is the powerhouse. It outperforms Qwen 3.5 27B on MMLU Pro (85.2%) and hits a Codeforces ELO of 2150 — competitive with models twice its size. But it needs a beefy GPU.

All models support structured tool use out of the box, can process images and audio (E2B/E4B), and work with agent frameworks like OpenClaw.


What You Need (Hardware Guide)

Before you install anything, check if your machine can handle it. Here’s the honest breakdown:

Gemma 4 E4B (Default — 8GB+)

This runs on basically any modern laptop. MacBook Air with 8GB? Fine. A gaming PC from 2020? Fine. It won’t blow you away on complex coding tasks, but it handles quick questions, code review, and simple generation surprisingly well.

Gemma 4 26B MoE (The Sweet Spot — 20GB+)

If you have a MacBook Pro with 16GB of unified memory, you can run the quantized version (Q4_K_M cuts memory by ~55-60%). Apple Silicon is genuinely underrated for running models in this size range.

On the GPU side, an NVIDIA RTX 3070/4070 with 12GB VRAM can handle the quantized version. 16GB VRAM (RTX 4080, A4000) runs it comfortably.

Gemma 4 31B Dense (Maximum Quality — 24GB+)

You need serious hardware here. NVIDIA RTX 3090/4090 with 24GB VRAM, or an Apple Silicon Mac with 32GB+ unified memory. Quantized (Q4) brings it down to around 18-20GB, which squeezes onto a 24GB card.

Real-world results from developers who’ve been testing: the Hugging Face CEO shared that 24GB gets you the 26B MoE (Q4_K_M quantization), while 16GB handles the E4B at full quality (Q8). One developer ran the 26B model on an A17 Pro chip with just 8GB at ~7 tokens/second. Not fast, but functional.

Rule of thumb: Your available memory should exceed the quantized model size. When in doubt, start with E4B and work up.


Step 1: Install Ollama

Ollama is the easiest way to run local AI models. One install, one command to pull a model, and you’re running.

Mac:

# Download from ollama.com, or:
brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com and run the .exe.

After installing, start the Ollama service:

ollama serve

It runs in the background on http://127.0.0.1:11434.


Step 2: Pull Gemma 4

One command. That’s it.

# Default (E4B — good for most people)
ollama pull gemma4

# 26B MoE (better quality, needs 20GB+)
ollama pull gemma4:26b

# 31B Dense (best quality, needs 24GB+)
ollama pull gemma4:31b

# E2B (smallest, for very limited hardware)
ollama pull gemma4:e2b

The download takes a few minutes depending on your connection. The E4B model is about 3GB quantized. The 26B MoE is around 16GB. The 31B Dense is roughly 20GB.


Step 3: Run It

ollama run gemma4

That’s it. You’re now chatting with Gemma 4 in your terminal. Type a question, get an answer. No API key. No billing. No signup.

Try something practical:

> Review this Python function for bugs and suggest improvements:
> def calculate_total(items):
>     total = 0
>     for item in items:
>         total += item.price * item.quantity
>     return total

Gemma 4 handles code review, debugging, refactoring suggestions, and general coding questions well — especially the 26B and 31B variants.


Step 4: Connect It to OpenClaw (Optional)

This is where it gets interesting — especially if you just lost your Claude subscription access.

OpenClaw has a bundled Ollama provider plugin. It detects your local Ollama instance automatically at http://127.0.0.1:11434.

To set it up:

  1. Make sure Ollama is running (ollama serve)
  2. Make sure you’ve pulled a Gemma 4 model
  3. In OpenClaw, set your model: /model ollama/gemma4:26b

Or configure it in your OpenClaw settings:

{
  "provider": "ollama",
  "model": "gemma4:26b",
  "baseUrl": "http://127.0.0.1:11434"
}

Now OpenClaw uses your local Gemma 4 instead of a cloud API. Zero cost per token. Complete privacy — nothing leaves your machine.

Developers are already doing this. Multiple posts show OpenClaw running with Gemma 4 on a MacBook Air M4 with 16GB — a $1,199 laptop running a free AI coding agent. One video titled “Google Just Made OpenClaw Free” walked through the entire setup and already has over 1,100 likes.

The tradeoff: Gemma 4 is good. It’s not Claude Sonnet or GPT-5 good. For complex multi-file refactoring, architectural decisions, or debugging subtle concurrency issues, you’ll notice the quality gap. But for straightforward coding tasks — generating boilerplate, writing tests, reviewing code, explaining errors — it gets the job done.


How Good Is It, Actually?

Let’s be honest about where Gemma 4 excels and where it doesn’t.

Where Gemma 4 shines:

  • Code review and refactoring — The 26B/31B models catch bugs, suggest improvements, and explain code clearly
  • Structured output — Native function calling with JSON schema support, no prompt tricks needed
  • Speed — The 26B MoE runs nearly as fast as a 4B model because only 3.8B parameters activate per token
  • Privacy — Everything stays on your machine. No data sent anywhere.
  • Cost — Free. Forever. Apache 2.0 license means no usage limits, no rate limiting, no surprise bills
  • Multimodal — E2B and E4B can process images and audio alongside text

Where it falls short:

  • Complex reasoning — Claude Opus and GPT-5 still win on multi-step logic and nuanced architectural decisions
  • Long context — Cloud models handle 100K+ token contexts. Local models on consumer hardware struggle past 8-16K
  • Speed on large tasks — Generating 500+ lines of code is noticeably slower than a cloud API
  • Latest knowledge — Training data cutoff means it won’t know about the newest frameworks or APIs

Quick benchmark comparison:

MetricGemma 4 31BClaude Sonnet 4.6GPT-5.4
MMLU Pro85.2%~90%+~88%+
Codeforces ELO2150HigherHigher
Cost/month$0$9-30 (API)$15-25 (API)
PrivacyCompleteCloud-basedCloud-based
Speed (local)Depends on hardwareFast (cloud)Fast (cloud)
Context window8-32K (hardware limited)200K128K

Early community benchmarks are interesting. One developer testing on dual 3090s found that “Gemma 26b > Qwen 35b” but “Qwen 27b > Gemma 31b” — meaning the MoE architecture punches well above its weight class. Another threw 4 UI screenshots at all three Gemma 4 sizes and asked them to rebuild the designs from scratch with no hand-holding — the results impressed enough to get 700+ likes.

The 31B model is roughly 85-90% as good as cloud models for everyday coding tasks. That last 10-15% matters for complex work, but for most developers most of the time, it’s more than enough.


Pro Tips for Better Performance

1. Use the right quantization. Q4_K_M is the sweet spot for most people — good quality, reasonable memory use. Q8 is higher quality but needs more RAM. Q2 is tiny but noticeably worse.

2. Pre-load the model. Cold starts take 5-10 seconds. Keep the model loaded:

# Keep model loaded for 30 minutes of inactivity
ollama run gemma4 --keepalive 30m

3. Set it to auto-start. On Mac, use a LaunchAgent. On Linux, a systemd service. You want Ollama running whenever your machine is on.

4. Use the 26B MoE, not the 31B Dense. Unless you have 32GB+ RAM and need maximum quality, the 26B gives you 90% of the performance at half the memory. The Mixture of Experts architecture is genuinely clever.

5. Pair with a cloud model. The smart setup: use Gemma 4 locally for quick tasks (code review, simple generation, explanations) and switch to a cloud API for heavy lifting (complex debugging, full-project refactoring). OpenClaw makes switching models easy — just /model to toggle.


The Bigger Picture: Why This Matters

Two things happened this week that, combined, change how developers can work with AI:

  1. Anthropic cut OpenClaw users off from subsidized Claude access
  2. Google released a genuinely capable open model you can run for free

The message is clear: the era of cheap cloud AI subsidies is ending. But the era of capable local AI is beginning.

A year ago, running an AI model locally meant dealing with mediocre quality, complicated setup, and constant compatibility issues. Today, it’s ollama pull gemma4 and you’re done. The model ranks #3 globally among open models. It runs on a MacBook Air. And it does honest, useful work.

You don’t have to choose one or the other. The developers who’ll get the most from AI in 2026 are the ones who mix local and cloud models based on the task — using free local inference for 80% of their work and paying for cloud APIs only when they need that extra quality boost.

Gemma 4 makes the “local” part of that equation actually viable.


Sources:

Build Real AI Skills

Step-by-step courses with quizzes and certificates for your resume