Gemma 4 31B Fits on a Single H100. Here's the Self-Host vs API Routing Math.
Google's Gemma 4 31B Dense lands at 89.2% AIME and 85.2% MMLU Pro under Apache 2.0, runs in BF16 on a single H100, and serves 855 tokens per second. The break-even point against API providers just dropped. Here is the routing math.
Gemma 4 31B Dense scores 89.2% on AIME 2026 math, 85.2% on MMLU Pro, 80.0% on LiveCodeBench v6, and 84.3% on GPQA Diamond. It runs in BF16 on a single 80GB H100 with 8K context headroom, or extended context unquantized on a 192GB B200 (the model supports up to 256K tokens). Peak throughput on a single H100 is 855 tokens per second, which is enough to serve roughly 28 concurrent users each streaming at 30 tok/s. The license is Apache 2.0. The release date was April 2, 2026.
The numbers matter because they reset a question most engineering teams stopped asking sometime in 2024: when does self-hosting an open model actually beat paying an API provider?
For most of 2024 and 2025, the answer was "almost never, unless you have a regulatory or data-residency reason." The open models that fit on consumer-tier hardware were too weak. The open models that competed with frontier APIs needed multi-GPU inference rigs that cost more to run than the API call. Self-hosting was a compliance story, not a cost story.
Gemma 4 changes the math. Here is what shifted, what did not, and which workloads should actually move.
The break-even floor
Take a single H100 SXM5 80GB on a major cloud at $3.50/hr reserved or $4.90/hr on-demand. Call it $2,500/month at 100% utilization on reserved capacity.
At 855 tokens per second peak, that single H100 can theoretically push out 2.2 billion tokens per month. Real utilization for a typical SaaS workload sits closer to 25-40% (peak hours dominate, traffic is spiky). Call it 30%, which is 660M tokens/month per H100.
That works out to a marginal cost of roughly $3.80 per million output tokens at 30% utilization, falling to $1.15 per million at sustained 100% load.
Compare to where the API providers actually price comparable-quality output:
- Gemini 2.5 Flash: $0.30 / $2.50
- Groq Llama 3.3 70B: $0.59 / $0.79
- GPT-4o-mini: $0.15 / $0.60
- DeepSeek V4 via API: $1.74 / $3.48
The honest read: at 30% utilization, self-hosted Gemma 4 31B is roughly cost-competitive with GPT-4o-mini on output and slightly more expensive on input, and noticeably more expensive than Groq's Llama tier or Gemini Flash. Self-hosting only wins when you can keep utilization above 60-70%, which most teams cannot do without aggressive batching.
So why bother?
Where self-hosting wins, despite the math
Three workload types flip the calculation:
1. Steady-state inference where utilization stays high. Background jobs, batch summarization pipelines, embedding generation, scheduled report rendering. If you can fill an H100 to 70%+ load consistently, the per-token cost drops below most API tiers and you keep the cost predictable. The unpredictability of API token pricing (Anthropic Opus 4.6 just dropped 67%, OpenAI doubled the frontier price last quarter) becomes a planning problem the self-hosted route avoids.
2. Workloads with strict data-residency or latency requirements. If your users are in a region where the nearest API endpoint is 200ms away, an in-VPC Gemma 4 deployment in your own region is faster than any API. Combined with sub-200ms TTFT on the LPU side, the latency story makes self-hosting genuinely competitive on user-facing chat.
3. Workloads where you fine-tune. Apache 2.0 lets you LoRA-tune Gemma 4 on your own data, deploy the adapted weights, and never share that data with a provider. For domain-specific tasks (legal-clause classification, medical-coding extraction, customer-support intent detection), a fine-tuned 31B can outperform a generic frontier API on the specific task while costing a fraction of the inference price.
What does not win: anything where the workload is bursty, where peak throughput exceeds what one or two H100s can serve, or where the team does not have an MLOps function that can keep an inference cluster healthy. We covered the broader version of this argument in our analysis of why open-weight routing is now a real cost lever. DeepSeek V4 Pro made the API-side open-weight argument. Gemma 4 makes the in-VPC version.
What the benchmark numbers actually mean for routing
Gemma 4 31B is not strictly better than every comparable model. The composite benchmark average puts it slightly below GLM-5.1 and DeepSeek V3.2, both of which are 20-25x larger in total parameters and 2-3x larger in active parameters. Where Gemma 4 wins is the parameter-efficient slice: it beats every other open model that fits on a single 80GB GPU.
That parameter-efficiency point is the routing argument. If your routing layer scores requests by complexity and routes the easy 70% to a small model, Gemma 4 31B is the strongest open option in that tier. The right pattern looks like:
- Tier 1 (50-60% of traffic): Self-hosted Gemma 4 31B for classification, extraction, summarization, intent detection
- Tier 2 (25-35% of traffic): API provider (Groq Llama, Gemini Flash, GPT-4o-mini) for moderately complex tasks where API speed-of-iteration matters
- Tier 3 (5-15% of traffic): Frontier API (GPT-5, Claude Opus, DeepSeek R2 reasoning) for tasks that genuinely need the capability
Most teams collapse this into "everything goes to GPT-4o" or "everything goes to a smart router." Either extreme leaves money on the table. The three-tier pattern is what we walked through in our complete LLM model routing guide; Gemma 4 is the first open model that makes Tier 1 viable on a single GPU at production quality.
The operational tax nobody talks about
The hidden cost of self-hosting is not the GPU bill. It is everything around the GPU bill:
- vLLM or TensorRT-LLM serving stack to maintain
- Auto-scaling logic for spiky traffic
- Quantization decisions (BF16 vs FP8 vs INT8) and the eval pass for each
- Monitoring for drift, throughput regressions, OOMs at peak
- On-call rotation for a new piece of infrastructure
- Eval pipelines to compare your fine-tuned variant against API alternatives over time
For a team with a dedicated ML platform group, this is a half-FTE of work. For a team without one, the overhead can erase the per-token savings entirely. The honest framing is that self-hosting Gemma 4 makes sense when you already run inference infrastructure for embeddings, vector search, or other ML workloads, and adding LLM inference is a marginal expansion of an existing competence. If you would be standing up the inference stack from scratch just for this, the API route is almost always cheaper once you account for engineering time.
What to do this quarter
If you are running a $50K+/month LLM bill, the practical sequence is:
- Identify your top three workloads by token volume.
- For each, ask: is this Tier 1 (small model territory) or Tier 2/3 (mid or frontier)?
- For Tier 1 workloads above 100M monthly tokens, run a Gemma 4 31B eval against your current API on a representative sample. If quality holds, calculate the break-even at your real utilization (not the theoretical max).
- If the break-even works, self-host that one workload first. Keep Tier 2 and Tier 3 on APIs for now.
- Reassess every quarter. The price-quality frontier shifts every six weeks; locking yourself into one route is the mistake.
The teams that win on cost over the next two years are not the ones that pick the cheapest route today. They are the ones that build a routing layer flexible enough to move workloads between self-hosted and API tiers as the price-quality frontier shifts.
How PromptUnit fits into this
PromptUnit currently routes across OpenAI, Anthropic, Google, and Groq — the API-side of the three-tier model above. The 14-day observation period identifies which of your workloads are Tier 1 candidates (cheap, fast, small-model territory) versus Tier 2 and Tier 3, before any traffic shifts.
The roadmap direction is to treat self-hosted endpoints as first-class routing targets alongside the API providers, so teams can move Tier 1 workloads to an in-VPC Gemma 4 deployment without changing their application code. That work is in progress.
For now, if you want to know which of your workloads belong on a smaller model tier — whether API or self-hosted — the observation period gives you the answer. Start at promptunit.ai.