Gemma 4 31B is now generally available

Developer Relations

Gemma 4 31B is the latest open-weights model from Google Deepmind, and a meaningful step up from Gemma 3. It's worth considering if you need a capable model for general reasoning, Nordic-language applications, or vision tasks — and want to keep inference costs predictable.
We made it available in early access for all users the morning after Google's announcement, and today we're making it generally available at €0.25/Mtok input and €0.50/Mtok output.
Try it now:
curl https://api.berget.ai/v1/chat/completions \
-H "Authorization: Bearer $BERGET_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-4-31B-it",
"messages": [{"role": "user", "content": "Your prompt here"}]
}'What you need to know
Benchmark performance that holds up to independent testing
- Google's headline numbers for Gemma 4 31B are large: 89.2% on AIME 2026, 80% on LiveCodeBench v6, 84.3% on GPQA Diamond — all in reasoning/thinking mode.
- Artificial Analysis later ran their own independent Intelligence Index evaluation and landed at 86% on GPQA Diamond, 76% on IFBench, and an overall index score of 39 — a 29-point jump over Gemma 3 27B.
- Arena.ai placed it at #3 among all open models on text tasks. The gap to the previous generation is real.
For coding specifically, the model's Codeforces ELO sits around 2,150. For comparison, Gemma 3 27B was at ~110.
The strongest open model for Nordic languages
EuroEval has evaluated Gemma 4 31B across Swedish, Norwegian, Danish, and Finnish. It ranks #1 among open-weights models in all four languages, with #1 overall in Finnish, ahead of all closed API models including Gemini Pro Preview.
| Language | Overall rank | Score | Open-weights rank |
|---|---|---|---|
| Finnish | #1 | 1.33 | #1 |
| Danish | #2 | 1.16 | #1 |
| Norwegian | #3 | 1.33 | #1 |
| Swedish | #3 | 1.22 | #1 |
The only models ranked above it in any language are closed API models (Gemini Pro Preview and Gemini Flash variants). If you're building for a Nordic-language audience, we highly recommend that you check this one out.
Efficient token usage for reasoning tasks
Artificial Analysis measured how many output tokens each model generated to complete their full Intelligence Index. Less output tokens used for reasoning means reduced cost for reasoning-heavy workloads. The four models closest to Gemma 4 31B in overall intelligence all score 3 points higher (42 vs 39), but use significantly more tokens to get there:
| Model | Intelligence Index | Output tokens used |
|---|---|---|
| Gemma 4 31B | 39 | 39M |
| MiniMax-M2.5 | 42 | 56M |
| DeepSeek V3.2 (Reasoning) | 42 | 61M |
| Qwen3.5 27B (Reasoning) | 42 | 98M |
| GLM-4.7 | 42 | 167M |
What Gemma 4 31B offers is a substantially lower token spend for general reasoning workloads, such as document analysis, Q&A, summarisation, and code explanation. If your workload is primarily agentic (multi-step tool use, complex coding pipelines), GLM-4.7 is the stronger choice. But if you're optimising for cost on general reasoning tasks, Gemma 4 31B is hard to beat at this intelligence level.
What to look out for
During our initial testing, we found that while Gemma 4 31B handles function calling well overall. But it would struggle with longer chains of tool calls. While we've seen substantial improvements following recent upstream improvements to vLLM, we still recommend testing it if you anticipate heavy tool use.
Learn more
For more details, see the following resources:
- Google model card — official benchmark table and architecture details
- Hugging Face model page — weights and community discussion
- Artificial Analysis evaluation — independent benchmark breakdown and token efficiency analysis
- EuroEval leaderboards — Nordic and European language results
Join us on Matrix to discuss Gemma 4 31B and other models with the team and other community members.