Gemma 4 31B is now generally available

Gemma 4 31B is the latest open-weights model from Google Deepmind, and a meaningful step up from Gemma 3. It's worth considering if you need a capable model for general reasoning, Nordic-language applications, or vision tasks — and want to keep inference costs predictable.

We made it available in early access for all users the morning after Google's announcement, and today we're making it generally available at €0.25/Mtok input and €0.50/Mtok output.

Try it now:

curl https://api.berget.ai/v1/chat/completions \
  -H "Authorization: Bearer $BERGET_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-31B-it",
    "messages": [{"role": "user", "content": "Your prompt here"}]
  }'

What you need to know

Benchmark performance that holds up to independent testing

Google's headline numbers for Gemma 4 31B are large: 89.2% on AIME 2026, 80% on LiveCodeBench v6, 84.3% on GPQA Diamond — all in reasoning/thinking mode.
Artificial Analysis later ran their own independent Intelligence Index evaluation and landed at 86% on GPQA Diamond, 76% on IFBench, and an overall index score of 39 — a 29-point jump over Gemma 3 27B.
Arena.ai placed it at #3 among all open models on text tasks. The gap to the previous generation is real.

For coding specifically, the model's Codeforces ELO sits around 2,150. For comparison, Gemma 3 27B was at ~110.

The strongest open model for Nordic languages

EuroEval has evaluated Gemma 4 31B across Swedish, Norwegian, Danish, and Finnish. It ranks #1 among open-weights models in all four languages, with #1 overall in Finnish, ahead of all closed API models including Gemini Pro Preview.

Language	Overall rank	Score	Open-weights rank
Finnish	#1	1.33	#1
Danish	#2	1.16	#1
Norwegian	#3	1.33	#1
Swedish	#3	1.22	#1

The only models ranked above it in any language are closed API models (Gemini Pro Preview and Gemini Flash variants). If you're building for a Nordic-language audience, we highly recommend that you check this one out.

Efficient token usage for reasoning tasks

Artificial Analysis measured how many output tokens each model generated to complete their full Intelligence Index. Less output tokens used for reasoning means reduced cost for reasoning-heavy workloads. The four models closest to Gemma 4 31B in overall intelligence all score 3 points higher (42 vs 39), but use significantly more tokens to get there:

Model	Intelligence Index	Output tokens used
Gemma 4 31B	39	39M
MiniMax-M2.5	42	56M
DeepSeek V3.2 (Reasoning)	42	61M
Qwen3.5 27B (Reasoning)	42	98M
GLM-4.7	42	167M

What Gemma 4 31B offers is a substantially lower token spend for general reasoning workloads, such as document analysis, Q&A, summarisation, and code explanation. If your workload is primarily agentic (multi-step tool use, complex coding pipelines), GLM-4.7 is the stronger choice. But if you're optimising for cost on general reasoning tasks, Gemma 4 31B is hard to beat at this intelligence level.

What to look out for

During our initial testing, we found that while Gemma 4 31B handles function calling well overall. But it would struggle with longer chains of tool calls. While we've seen substantial improvements following recent upstream improvements to vLLM, we still recommend testing it if you anticipate heavy tool use.

Learn more

For more details, see the following resources:

Google model card — official benchmark table and architecture details
Hugging Face model page — weights and community discussion
Artificial Analysis evaluation — independent benchmark breakdown and token efficiency analysis
EuroEval leaderboards — Nordic and European language results

Join us on Matrix to discuss Gemma 4 31B and other models with the team and other community members.