Batch API Pricing Is 50% Off Across Every Major Provider. Here's How to Decide What Belongs There.
OpenAI, Anthropic, Google, and Groq all offer 50% off batch processing in exchange for a 24-hour completion window. Stacked with prompt caching, the effective discount reaches 75-95%. The hard part is figuring out which workloads can actually wait.
OpenAI gives you 50% off every model when you send the request through the Batch API and accept up to a 24-hour completion window. Anthropic does the same. Google does the same on Gemini. Groq does the same on its open-weight tier. The discount is flat, model-agnostic, and stacks with prompt caching for an effective 75-95% reduction on the right workloads.
The math is simple. The execution is not. Most engineering teams know batch tiers exist. Most have one or two workloads running on batch. Most have three to five more workloads that could be running on batch and are not, because the routing decision was made before batch was a serious lever and nobody has gone back to revisit it.
This post is about the workloads that should already be on batch, the ones that should not, and the surprisingly small set of architectural changes that move the needle.
What "batch" actually means in 2026
A batch request submits a file of prompts to the provider, gets queued behind real-time traffic, and returns results within a stated window. Across providers:
- OpenAI: flat 50% off every model. 24-hour completion window. Most batches actually finish in 1-6 hours depending on system load. No priority tier; you accept the queue or you do not use batch.
- Anthropic: 50% off across all Claude models. 24-hour window. Batch results stream back as they complete; you do not have to wait for the whole batch to finish before reading partial results.
- Google Gemini: flat 50% discount on every request. 24-hour window. Particularly useful for long-context calls where the per-call cost is significant.
- Groq: 50% off batch processing. The window is generally shorter in practice (Groq's underlying speed advantage extends to its batch tier), often completing within 1-2 hours.
The discount is not a "negotiated rate." It is a published tier, available to any team with an API key, with no volume commitment.
What batch is for
Three workload patterns capture roughly 80% of production batch wins:
1. Overnight processing pipelines. ETL jobs that take a day's worth of customer data, classify it, summarize it, extract structured fields, and write the results to a database before the next business day. The latency budget is "by 8 AM tomorrow," not "in 200 milliseconds." These workloads have always belonged on batch; the only question is whether they currently are.
2. Large-scale evaluation runs. Quality evaluation on a routing change, a prompt tweak, or a model swap. Eval runs typically process 10K-100K test cases against multiple model variants. At standard pricing, a comprehensive eval can cost $500-$5,000 per run. At batch pricing, $250-$2,500. Teams that run evals weekly save thousands per quarter without changing anything else.
3. Document processing backlogs. Onboarding flows where new customers upload years of historical documents to be summarized, indexed, or analyzed. The user does not need the results immediately; a "your documents will be ready within 24 hours" message is acceptable. The cost gap between batch and real-time on a 50K-document onboarding workload is the difference between profitable and unprofitable for the customer-onboarding line item.
What ties these together: the user's mental model already includes a wait. Batch wins when the latency expectation has been set elsewhere in the product.
What batch is not for
Three patterns where the math goes the other way:
Anything user-facing. Chat, search, autocomplete, real-time recommendations. The 24-hour window kills the user experience even if the average completion is 1-6 hours. Tail risk is the killer here: one batch in ten that takes the full 24 hours is enough to break the product.
Multi-step agent loops. Each step of an agent loop depends on the previous step's output. Even if the latency budget is generous, a batch round-trip per step makes the agent unusable. This is true even for "research agent" workloads where the user expects to wait minutes; 24 hours is on a different scale entirely.
Workloads with strict completion timing. End-of-month reports, intraday compliance checks, time-zone-sensitive notifications. The 24-hour window is "up to," and a batch that finishes 23 hours after submission misses a 10 PM cutoff. Real-time tier with retries is more predictable for these.
The rule that captures the boundary: if the workload can tolerate the worst case of 24 hours, batch. If it cannot, do not.
The stacking math nobody walks through
Batch is a multiplicative discount, not an additive one. It stacks cleanly with prompt caching, with reserved capacity, and with model-tier routing.
Take a typical evaluation pipeline running 1M monthly tokens against GPT-5.4 (assume base $2.50 input / $10 output):
- Standard real-time: 500K input + 500K output = $1.25 + $5.00 = $6.25/month per million tokens
- With Batch API (50% off): $3.125/month
- With Batch API + 80% prompt cache hit on input: 100K full-price input + 400K cached input + 500K output, all batched: $0.125 + $0.05 + $2.50 = $2.675/month
- Same pipeline routed to GPT-5.4-Mini on batch: roughly $0.40/month
That is a 94% reduction from the starting point, achieved by combining three independent pricing tiers (model selection, batch discount, cache discount) without any new infrastructure. We covered the caching half of this in our guide to reducing OpenAI API costs; batch is the second half of the same compounding story.
Most teams capture one of these levers, sometimes two. Capturing all three is the difference between a $50K monthly bill and a $5K monthly bill on the same workload, with no quality change.
The architectural cost of batching that nobody mentions
The 50% discount is real, but the operational cost is not zero.
Job orchestration. Real-time API calls are stateless. Batch jobs require submission, polling, retry logic, and result reconciliation. If your codebase does not already have a queue-and-result pattern (Celery, Sidekiq, SQS-backed worker pool), adding one for batch is a week of work.
Observability gap. Real-time calls are easy to monitor: latency, error rate, p99. Batch jobs introduce a new failure mode (batch returned partially, batch timed out, batch result file truncated) that needs separate monitoring infrastructure. Teams that bolt batch onto a real-time codebase often have batch failures sitting silent for days.
Schema drift. A real-time API call validates schema on the way out and on the way back. A batch submission validates only at the end. If you ship a prompt template that produces invalid JSON in 1% of cases, real-time catches it the first day; batch catches it 24 hours later, after 50,000 results have come back malformed.
These are not reasons to avoid batch. They are reasons to invest the infra week before moving meaningful workloads onto it.
A staged rollout that actually works
For a team running a $50K/month LLM bill, here is the sequence that captures most of the available batch savings without breaking anything:
- Week one: identify candidates. List every workload by token volume and ask, "what is the latency requirement?" Sort the list by tokens-per-month for workloads where the answer is "more than an hour."
- Week two: pick one. The largest non-user-facing workload. Build the batch submission and result-reconciliation logic for that one workload only.
- Week three: instrument. Add monitoring for batch failures, partial returns, and schema drift specific to that workload.
- Week four: cut over. Move the workload to batch. Watch the bill drop. Keep a small percentage on real-time for a week as a quality control.
- Month two onwards: repeat the pattern for the next-largest qualifying workload. Each cycle gets faster as the orchestration code matures.
Teams that try to move five workloads to batch in the same sprint typically end up with three half-working integrations and a real-time fallback for the other two. The slow rollout captures more savings.
This is the same staged-rollout philosophy we walked through in our broader guide to reducing OpenAI API costs by 50-70% without changing your code. Batch is one lever among several. Stacked with caching, model routing, and prompt compression, the cumulative discount lands well above any single lever's ceiling.
Where PromptUnit fits
PromptUnit's observation mode logs every API call and classifies it by task type and latency pattern. After 14 days, you can see which of your workloads are candidates for batch by token volume and request frequency, before you move a single workload or write any orchestration code. The dashboard shows what your bill would look like if the batch-eligible slice were running on the batch tier, so the decision is data-driven rather than a guess.
The real-time proxy layer (model routing, prompt compression, caching) handles the non-batch workloads in the meantime. Both levers are visible in the same cost breakdown.
If your monthly LLM bill includes workloads that could plausibly run overnight and you have not moved them to batch, the savings are typically in the 30-45% range on those endpoints alone. Start the free observation period at promptunit.ai and see what fraction of your traffic is sitting on the wrong tier.