All posts
·8 min read

DeepSeek V4 Pro Landed at $1.74/$3.48. Open-Weight Routing Is Now a Real Cost Lever.

DeepSeek V4 Pro launched April 24, 2026 with frontier-grade coding benchmarks at one-third the price of Claude Opus 4.7 and one-eighth the output cost of GPT-5.5. Here is where it fits in production routing.

llm cost optimizationcross-provider llm routingai infrastructure costproduction llm optimizationmodel routing llm

DeepSeek dropped V4 Pro and V4 Flash on Hugging Face on April 24, 2026, after three delays spanning four months. The Pro variant is a 1.6 trillion-parameter MoE model with 49 billion active parameters, a 1 million-token context window, MIT-licensed open weights, and pricing on the hosted API at $1.74 per million input tokens and $3.48 per million output tokens.

That last number is the part to read twice. Claude Opus 4.7 costs $5 input and $25 output. GPT-5.5 costs $5 input and $30 output. DeepSeek V4 Pro is roughly one-third the input cost and one-eighth the output cost of either flagship, and on three of the four major coding benchmarks it either matches or beats them.

This is the first time in 18 months that open-weight routing is a serious cost lever for production engineering teams, not a science project. The math, and the gotchas, are below.

The benchmarks, by the numbers

On SWE-Bench Verified, the standard for autonomous software engineering, DeepSeek V4 Pro scored 80.6%. Claude Opus 4.7 scored 87.6%. That is a 7-point Anthropic lead, and it is the one place where Opus still wins decisively.

On Terminal-Bench 2.0, V4 Pro scored 67.9%. Claude Opus 4.7 scored 65.4%. DeepSeek wins this one by 2.5 points.

On LiveCodeBench, V4 Pro hit 93.5% against Claude's 88.8%. A 4.7-point DeepSeek lead.

On Codeforces, the competitive programming benchmark, V4 Pro reached an Elo of 3,206. GPT-5.4 sat at 3,168. That is the highest competitive-programming score any model has posted publicly.

Add in the architectural work: V4's hybrid attention uses 27% of the FLOPs and 10% of the KV cache compared to V3.2 at 1M context, which means latency at long context is materially better than the prior generation and competitive with Anthropic and OpenAI on inference economics.

The summary: DeepSeek V4 Pro is a coding-tier frontier model. It loses on SWE-Bench Verified, the headline benchmark Anthropic optimizes for. It wins on terminal-style tasks, competitive programming, and long-form code generation. For most production coding workloads, that profile is more useful than the SWE-Bench Verified delta suggests.

The pricing math

Take a coding-heavy workload running 500 million tokens per month, split 50/50 input to output. Plug in current public pricing.

On Claude Opus 4.7 at $5/$25, the bill is $7,500 per month. On DeepSeek V4 Pro at $1.74/$3.48, the same workload costs $1,305 per month. The delta is $6,195 per month, or $74,340 per year, on a single workload.

You will not move 100% of traffic to DeepSeek. You should not. SWE-Bench-style hard coding tasks should still route to Opus 4.7 where the 7-point gap matters. But if 60% of your coding traffic is in the LiveCodeBench or Terminal-Bench category (and in our routing data, it usually is), then 60% of $7,500 is $4,500, and you have captured roughly $50,000 a year on a single workload by routing per task.

Bigger workloads scale linearly. At 5 billion tokens per month, the same split puts the annual delta closer to $500,000 if the routing classifier sends the right traffic to V4 Pro. We covered the underlying analysis pattern in our breakdown of 10,000 GPT-4o calls, and the conclusion holds for V4 Pro: most coding traffic does not need the most expensive model.

The self-hosting option

The other lever V4 Pro opens is self-hosting. The model is MIT-licensed and the weights are public on Hugging Face. For teams running 10 billion or more tokens per month, the math on self-hosting on H200s, MI355X, or rented Huawei Ascend 950 nodes starts to compete with the hosted API rate at scale.

Self-hosting is not free. You need GPU capacity, an inference server (vLLM, SGLang, or TensorRT-LLM), a routing layer that can talk to your endpoint in the same format as the hosted providers, and an on-call rotation that can debug an MoE model when expert routing goes sideways. For most teams below the 5-billion-tokens-per-month threshold, the hosted API at $1.74/$3.48 is still cheaper than running the model yourself.

But for teams above that threshold, the calculus is different than it was 30 days ago. V4 Pro is the first open-weight model where self-hosting at production scale produces a real bill cut against the hosted-frontier alternative. Llama 4 Maverick and Gemma 4 are good models; neither hits frontier coding benchmarks. V4 Pro does, and it is the inflection point for the open-weight self-host conversation.

Where V4 Pro routes

The routing rules look like this in our customer data so far:

Route to V4 Pro: long-form code generation, terminal and shell agent tasks, competitive-programming-style algorithmic problems, code review on diffs, syntax-heavy refactors, structured-output coding (JSON, SQL, regex generation), and any task where LiveCodeBench-style performance is the primary quality signal. Long-context analysis tasks where the 1M window matters and latency is sensitive also fit here, given V4's hybrid-attention efficiency.

Keep on Opus 4.7: hard SWE-Bench-style autonomous coding agents, multi-file refactors with deep reasoning, agentic loops with five or more tool calls per turn, and anything where the 7-point SWE-Bench Verified gap turns into a meaningful quality gap on real customer outputs.

Keep on GPT-5.4 or 5.5: knowledge work, agentic browsing, OSWorld-style computer use, multimodal generation, and tool-use tasks where OpenAI's tooling stack still leads on Terminal-Bench-equivalent agentic benchmarks.

Route to V4 Flash or smaller: classification, intent detection, retrieval reranking, structured extraction, and the rest of the categories we covered in our earlier comparison of when the cheaper model wins.

The pattern: the model lineup in April 2026 has more useful price-stratified options than at any point since 2023. Static "everything goes to GPT" defaults are now visibly leaving money on the table.

The objections, addressed

Three objections come up on every open-weight routing discussion, and they are worth answering directly.

First: data residency and compliance. DeepSeek V4 Pro on the hosted API runs on infrastructure operated by DeepSeek. For teams with EU data-residency requirements, FedRAMP requirements, or contractual obligations to keep customer data inside specific jurisdictions, the hosted API may not be usable. The MIT license and open weights mean self-hosting on infrastructure you control is a real option, and unlike most proprietary frontier models, this is the path that makes V4 Pro usable for regulated workloads. Most teams that hit this objection end up self-hosting V4 Pro and routing through it the same way they would route through OpenAI, which we covered conceptually in cross-provider LLM routing.

Second: rate limits and reliability. Hosted DeepSeek inference is newer than OpenAI or Anthropic, and rate limits at production scale will be a real constraint for the next several months. The mitigation is the same as for any provider: a routing layer that fails over automatically to a backup model when rate limits or errors spike. We wrote about why this matters earlier this week in our piece on the April 20 OpenAI outage and single-provider risk.

Third: the geopolitical objection. DeepSeek is a Chinese company, and some teams have policies against routing customer traffic through Chinese-operated infrastructure. This is a legitimate constraint and the answer is the same as the data-residency answer: self-host the open weights, run inference on infrastructure you control, and capture the cost benefit without the sovereignty concern. The MIT license makes this clean.

What to do with this

If your team is running coding-heavy LLM workloads and has not benchmarked V4 Pro against your current routing setup, this is the highest-leverage thing you can do this week. The bake-off is straightforward: pull a representative sample of your coding traffic, run it through V4 Pro on a hosted endpoint or a self-hosted instance, and compare outputs against your current quality bar. The cost delta will surprise most teams, and the quality delta on most workloads will be smaller than the price tag implies.

PromptUnit's Inferio routing layer is adding native DeepSeek V4 Pro and V4 Flash support to its provider lineup. Once live, coding traffic that currently routes to Claude Opus 4.7 or GPT-5.4 by default will be eligible to land on V4 Pro for tasks where the routing classifier flags it as a LiveCodeBench-style or Terminal-Bench-style fit, automatically and per request, with no customer code changes. Teams swap a base URL, run for 14 days in observation mode to see projected savings, then flip the switch. Pricing is 20% of verified savings, with no flat fee.

If your team has been deferring open-weight routing because it never quite hit frontier quality, V4 Pro is the model that closes that gap on coding. Start at promptunit.ai.

Start your 14-day observation period

See exactly how much you'd save before paying anything. Zero risk. if we save you $0, you pay $0.

Get started free →