DeepSeek R2 vs o3: A 30x Cost Drop on Reasoning, With Better Benchmarks
DeepSeek R2 ships as a 32B open-weight reasoning model at $0.07/$0.27 per million tokens, beating o3 on MATH, GPQA, AIME and Codeforces. If you default to o3 for reasoning, you are now paying a 30x premium. Here is how to route around it.
DeepSeek R2 launched this month at $0.07 per million input tokens and $0.27 per million output tokens. OpenAI's o3 sits at $2.00 / $8.00. That is a 28x gap on input and a 30x gap on output. The expected tradeoff would be a quality drop. R2 does the opposite: it scores 98.1% on MATH vs o3 at 97.0%, 78.4% on GPQA vs 75.1%, 92.7% on AIME 2025 vs 80.8%, and 2,318 on Codeforces vs 2,104.
If your production traffic routes "reasoning-heavy" calls to o3 or o3-pro by default, the math no longer works. You are paying a 30x premium for slightly worse benchmark scores on the categories that supposedly justify the spend.
This post is about what changed, what to do about it, and the two routing mistakes most teams will make in the next 30 days.
The reasoning-model premium that just collapsed
For most of the last 18 months, teams treated "reasoning" as a justification for using the most expensive model in the lineup. The argument went something like: this is a math problem, or a coding problem, or a multi-step planning problem, so we need o3 / o4-mini / Claude Opus reasoning mode, and the cost is the cost.
That argument was already shaky. We covered the broader version in our analysis of 10,000 GPT-4o calls, where 60% didn't need GPT-4o. The reasoning-model premium was a smaller, more defensible version of the same overpayment. Reasoning models really do beat non-reasoning models on hard math, complex coding, and multi-step logic. So if your task is genuinely reasoning-heavy, the premium had a reason.
DeepSeek R2 is a 32B dense open-weight model released under MIT license. The base is DeepSeek V4 (800B total, 45B active in MoE form), trained with GRPO v2, with explicit structured thinking and code execution during the reasoning trace. It uses roughly 20% fewer tokens per problem than its predecessor, which compounds the cost advantage because reasoning models are billed on the chain-of-thought tokens, not just the final answer.
Combine the per-token price with the lower token count and the effective cost gap on real reasoning workloads is closer to 35-40x, not 30x.
What the benchmark data actually says
R2 is not strictly better than o3 on every axis. It is competitive or better on the four benchmarks where reasoning models earn their premium:
- MATH-500: R2 98.1%, o3 97.0%
- GPQA Diamond: R2 78.4%, o3 75.1%
- AIME 2025: R2 92.7%, o3 80.8%
- Codeforces rating: R2 2,318, o3 2,104
These are the benchmarks people cite when they argue for o3-class spend. They are also the benchmarks where a 32B open-weight model now leads.
What R2 does not do: tool use at o3's level, multimodal reasoning, or the long-context retrieval reasoning that o3-pro is tuned for. If your workload is "give me a 200-page contract and find every clause that conflicts with this regulation," o3-pro is still the right call. If your workload is "solve this competition math problem" or "fix this hard algorithmic bug," R2 is now both cheaper and better.
The two mistakes teams will make in the next 30 days
Mistake one: route everything labeled "reasoning" to R2.
This is the open-weight enthusiast's version of "default to o3." It collapses a routing decision into a model-class assumption. Most production calls labeled "needs reasoning" by an internal classifier or by a prompt-engineering convention do not need a reasoning model at all. They need a structured prompt, a few-shot example, and a non-reasoning model like GPT-4o-mini or Gemini 2.5 Flash at $0.10-$0.30 per million input tokens. Routing those to R2 is cheaper than routing them to o3, but it is still 3-5x more expensive than routing them to a non-reasoning model that handles them just as well.
The right routing decision is two-stage: first decide whether the call needs reasoning at all, then decide which reasoning model. Skipping the first stage means R2 becomes your new expensive default.
Mistake two: assume R2 is a drop-in for o3 on every reasoning call.
R2 wins on math, science, and competitive coding. It loses on tool-heavy agentic loops, vision-grounded reasoning, and some long-context tasks. A naive "swap o3 for R2" rollout will silently degrade the workloads that depended on o3's tool-use reliability. If you have a function-calling-heavy agent, run the eval on R2 before you flip the routing rule.
The pattern that works: define your reasoning workloads by task type, route the math/science/code subset to R2, keep the tool-heavy agentic subset on o3 for now, and test R2 on tool use again in 60 days when the open-source community has had time to fine-tune it.
What this means for your bill
Take a typical mid-sized team running 50M reasoning-class tokens per month, split 30M input / 20M output.
On o3, that is $60 + $160 = $220K per month.
On R2, that is $2.10 + $5.40 = $7.50K per month (priced through DeepSeek's API; self-hosted is a different math).
Even if only 60% of your reasoning workload is the kind R2 handles well, you save roughly $127K per month by routing the right 60% to R2 and keeping the rest on o3. That assumes zero quality regression, which the benchmarks support for the right task subset.
For teams running closer to $500K-$1M monthly in reasoning spend, the savings are large enough that the routing logic pays for itself in days.
We laid out the underlying math format in our OpenAI API cost calculator and pricing guide. Plug in R2's numbers against your current reasoning spend; the gap is hard to ignore.
Why this is a routing problem, not a vendor-switch problem
The temptation when a 30x cheaper model lands is to migrate. Rip out the OpenAI integration, point the SDK at DeepSeek, and call it a quarter.
That is the wrong shape of the answer. Production LLM workloads are not homogeneous. A real engineering team has:
- Reasoning calls that genuinely need o3-class capability (small share, high cost)
- Reasoning calls that R2 handles equal or better (medium share, was high cost)
- "Reasoning" calls that any non-reasoning model could handle (largest share, was high cost by accident)
- Non-reasoning calls already on cheaper models (working fine)
A vendor swap optimizes one of those four. A routing layer optimizes all four, and adapts when the next R3, GPT-6 reasoning mode, or Claude reasoning tier shifts the price-quality frontier again.
This is the argument we made in our cross-provider LLM routing post: paying less should not mean getting less. The DeepSeek R2 release is the strongest single data point in favor of that argument since the original DeepSeek V3 launch in late 2024.
How to implement two-stage reasoning routing
If you are doing this yourself, the logic is straightforward. Classify the request before you decide which model to use:
from openai import OpenAI
client = OpenAI(
base_url="https://api.promptunit.ai/v1",
api_key="your-promptunit-key",
)
response = client.chat.completions.create(
model="auto", # PromptUnit routes based on task type
messages=[{"role": "user", "content": prompt}],
extra_headers={
"x-promptunit-feature": "math-solver", # tag by task type
}
)
With PromptUnit, the task classification and model selection happen inside the proxy. Math and science calls route to R2. Tool-heavy agentic calls stay on o3. You do not maintain a routing table; the router learns from quality signals across calls.
Without a proxy, the minimum viable version is a local classifier that checks for keywords or embeds the request and measures similarity to a "reasoning required" centroid, then branches to the appropriate client. The operational cost of maintaining that classifier — keeping the routing rules current as models evolve — is the real argument for a routing layer.
How PromptUnit handles this
PromptUnit's routing layer treats R2 as another node in the cross-provider graph. When a request comes in, the router scores it on task type and complexity, checks the quality-fingerprint signal across millions of similar calls, and routes to the cheapest model that meets the quality bar. For reasoning workloads, that often means R2 today where it would have meant o3 last month, but only on the task subset where R2 actually leads. Customers do not change their code; the OpenAI-format request gets dialect-translated to DeepSeek's API, the response comes back in OpenAI format, and the savings show up on the next bill. The 14-day observation period catches quality regressions before any traffic shifts, and the circuit breaker keeps o3 as a hot fallback when R2's API hiccups.
If you are spending more than $5K/month on o3, o3-pro, or Claude Opus reasoning calls, the observation period takes 5 minutes to set up and the data will tell you exactly what fraction of that spend can move to R2 without touching quality. Start free at promptunit.ai.