Gemma 4

Google DeepMind · Released April 2, 2026 (12B Unified: June 3, 2026)

8.1 /10 Overall Rating

What It Actually Is

Most AI model launches give you one model and one decision: use it or don’t. Gemma 4 gives you five models and asks a different question: what hardware do you have?

That might sound like marketing, but it’s actually the most useful thing about this family. Each member is architecturally different — not just a scaled-up copy of the same thing. The edge models use Per-Layer Embeddings. The 12B threw out vision and audio encoders entirely. The 26B routes tokens through a mixture of 128 experts. The 31B just throws all 30.7 billion parameters at every token. Same family, different engineering philosophies, different trade-offs.

Let’s walk through them.

E2B — The Pocket AI (~1 GB RAM)

The smallest Gemma 4. Two billion parameters, quantized down to fit in about 1 GB of RAM. It handles text, images, and live audio — all on-device, all offline. Scores 37.5% on AIME 2026, which is competitive math reasoning on something that could run on a Raspberry Pi. The secret sauce is Per-Layer Embeddings (PLE), which gives each decoder layer its own dedicated embedding to maximize intelligence without bloating parameter count. You won’t mistake it for a desktop model, but for quick translations, photo questions, or voice queries on a budget phone, it’s genuinely useful.

E4B — The Phone Powerhouse (4–6 GB RAM)

The E4B is what happens when you give a phone-optimized model enough parameters to actually think. It scores 42.5% on AIME 2026 — more than doubling Gemma 3’s much larger 27B model. It handles text, images, and audio natively, has a 128K context window, and includes a configurable thinking mode for multi-step reasoning. If you have a modern flagship phone with 8+ GB of RAM, this is the model that makes “I’ll just ask my phone — offline” a serious option instead of a party trick.

12B Unified — The Laptop Game-Changer (~7 GB VRAM with QAT)

This is where Gemma 4 gets exciting for most people. Released June 3, 2026, the 12B Unified does something no other model its size does: it handles text, images, and audio in a single decoder-only transformer with no separate encoders. Raw image patches and audio waveforms go straight into the embedding space through lightweight linear layers. Simpler architecture, lower latency, easier fine-tuning.

The numbers: 77.2% MMLU Pro, 77.5% AIME 2026, 72.0% LiveCodeBench, 78.8% GPQA Diamond. Google says it approaches the 26B MoE “at less than half the total memory footprint.” With the official QAT (Quantization-Aware Training) variant released June 5, the Q4_0 version needs roughly 6.7 GB of VRAM. Pair that with Multi-Token Prediction for speculative decoding, and community benchmarks show 100–130+ tokens per second on a 12 GB GPU like the RTX 4070 Super. It even runs on laptops with 16 GB unified memory — no dedicated GPU required.

If you want one model from this family and you have a laptop with a decent GPU, this is it.

26B MoE — The Efficiency Expert (15–18 GB VRAM quantized)

The 26B contains 26 billion total parameters, but here’s the trick: only 3.8 billion activate per token. A learned router selects 2 of 128 expert sub-networks for each token, delivering near-31B quality at dramatically lower compute cost. Think of it as having a building full of specialists and only calling the two you need for each question.

It supports text, images, and video (not just audio like the smaller models), has a 256K context window, and ranks #6 among open models on Arena AI. The trade-off is VRAM — you need 15–18 GB quantized, which means an RTX 4090, an RTX 5060 Ti 16GB, or a Mac with 32 GB+ unified memory. If you have the hardware and want the best intelligence-per-watt ratio, this is your model.

31B Dense — The Uncompromising Giant (16–20 GB VRAM quantized)

No routing, no mixture of experts, no shortcuts. The 31B Dense fires all 30.7 billion parameters on every single token. It’s the quality ceiling of the Gemma 4 family — ranking #3 among all open models on Arena AI and scoring 89.2% on AIME 2026. Same modalities as the 26B (text, images, video), same 256K context window, but with maximum reasoning depth on every response.

The cost is compute. BF16 needs ~71 GB of VRAM (enterprise GPU territory). But quantized to INT4, it fits in 16–20 GB — manageable on a high-end consumer GPU. If you have the hardware and precision matters more than speed, this is the open model that gets closest to frontier cloud performance.

Which one should you pick?

Here’s the honest cheat sheet:

Phone, offline, quick tasks → E4B (or E2B for very constrained devices)
Laptop, 8–12 GB GPU → 12B Unified with QAT
Laptop, 16 GB unified memory, no GPU → 12B Unified with QAT (slower but works)
Workstation, RTX 4090 / 32 GB Mac → 26B MoE (best quality-per-watt)
Server or high-end workstation → 31B Dense (maximum quality)

All five share the Apache 2.0 license, support 140+ languages, and work with Ollama, llama.cpp, LM Studio, vLLM, and Google’s AI Edge toolkit. The family disagrees on architecture — but agrees on philosophy: serious AI that runs on your hardware.

Key Strengths

E2B — AI on a budget phone (1 GB RAM): The smallest family member fits quantized in ~1 GB of RAM. Text, images, and audio — all on-device, all offline. Scores 37.5% on AIME 2026, which would’ve been impressive for a desktop model two years ago. Uses Per-Layer Embeddings (PLE) to squeeze maximum intelligence from minimal parameters. Ideal for IoT, Raspberry Pi, and budget Android devices.
E4B — flagship phone AI (4–6 GB RAM): The sweet spot for mobile. Scores 42.5% on AIME 2026 — more than doubling Gemma 3’s 27B model. Handles text, images, and audio natively. 128K context window. Built-in thinking mode for complex reasoning. This is a genuinely capable AI assistant running entirely on your phone without internet. If you have a modern flagship, this is your model.
12B Unified — the laptop game-changer (~7 GB VRAM with QAT): The star of the family. Encoder-free architecture — no separate vision or audio encoders. One transformer handles text, images, and audio natively. QAT variant runs at ~6.7 GB VRAM (Q4_0), fitting a 12 GB RTX 4070 or a laptop with 16 GB unified memory. MTP speculative decoding delivers 100–130+ tok/s. Scores 77.2% MMLU Pro, 77.5% AIME 2026, 72.0% LiveCodeBench. Approaches the 26B MoE at half the memory.
26B MoE — workstation efficiency (15–18 GB VRAM quantized): 26 billion parameters total, but only 3.8 billion activate per token. A learned router picks 2 of 128 experts per layer, giving you near-31B quality at a fraction of the compute. Supports text, images, and video. 256K context. Ranks #6 among open models. Ideal for developers with an RTX 4090 or Mac with 32 GB who want the best quality-to-speed ratio.
31B Dense — the quality ceiling (16–20 GB VRAM quantized): Every one of 30.7B parameters fires on every token. No routing, no shortcuts — maximum reasoning depth. Ranks #3 among open models. 89.2% on AIME 2026. Text, images, video. 256K context. If you have the VRAM (RTX 4090 or 64 GB Mac), this is the open model that gets closest to frontier cloud models.

Benchmark Snapshot

AIME 2026 — 31B: 89.2%, 12B: 77.5%, E4B: 42.5%, E2B: 37.5% Competitive math. Shows the clear quality ladder across the family — from phone-sized to server-class. The 12B hits serious math territory from a laptop.
MMLU Pro — 12B: 77.2% Professional knowledge reasoning. The 12B approaches the 26B MoE (~97% of its score) while using less than half the memory. Exceptional intelligence-per-parameter.
LiveCodeBench v6 — 12B: 72.0% Real-world coding evaluation. The 12B is a legitimately capable local coding assistant — strong enough for daily development work without cloud dependency.
GPQA Diamond — 12B: 78.8% Graduate-level science Q&A. Scores that would have been frontier-tier a year ago, running on consumer hardware with QAT quantization.
Arena AI — 31B: #3, 26B MoE: #6 (open models) Crowd-sourced head-to-head comparison. The 31B is top-tier among open models; the 26B MoE comes within 1–2% at a fraction of the compute.
Codeforces ELO — 12B: 1659 Competitive programming. Strong enough to solve non-trivial algorithmic problems locally. The 26B/31B score even higher.

Honest Limitations

Edge models trade depth for portability: E2B and E4B won’t match the 12B on complex reasoning, multi-step coding, or deep analysis. They’re optimized for quality-per-byte, not absolute quality. Great for quick tasks, not research.
12B needs a real GPU (or beefy laptop): Even with QAT, you need ~7 GB free VRAM for inference. That means a dedicated GPU (GTX 1080+ class) or a laptop with 16 GB+ unified memory. Integrated graphics won’t cut it for usable speeds.
26B/31B need serious hardware: Quantized, you’re looking at 15–20 GB VRAM. Unquantized (BF16), the 31B needs ~71 GB. These are workstation or high-end laptop models, not something for a budget setup.
No video on edge or 12B: Video understanding is only available on 26B and 31B. The smaller models handle text, images, and audio only.
Google tooling preferred: Best supported through MediaPipe, LiteRT, Google AI Edge SDK, and AI Studio. Ollama, llama.cpp, and LM Studio work fine, but expect the occasional rough edge versus the Google-optimized path.
Not designed for marathon sessions: Unlike frontier cloud models that run multi-day autonomous coding sprints, Gemma 4 is built for single and multi-turn inference — not sustained agentic marathons.

The Verdict: Gemma 4 is the most practical open model family released this year — not because any single model is the best at everything, but because there’s a Gemma for every situation. Building an offline phone assistant? E4B. Need a private coding companion on your laptop with a 12 GB GPU? The 12B Unified with QAT. Running a workstation with an RTX 4090 and want maximum quality per watt? The 26B MoE. Need absolute frontier-class open reasoning? The 31B Dense. The architectures are different (PLE, encoder-free, MoE, dense), the hardware requirements are different (1 GB to 71 GB), but they share the same license (Apache 2.0), the same multilingual support (140+), and the same philosophy: serious AI that runs on your hardware, not someone else’s cloud.