Your OpenAI API Bill Is Too High: Six Options, Ordered…

A team running GPT-5.4 for all their API calls at $2.50 per million input tokens and $15.00 per million output tokens can often cut their bill to under 40% of its current level without their users noticing a difference. The savings come from six approaches, none of which require rebuilding the product. They stack on top of each other, and the first two require essentially no engineering work. Start there.

Option 1: Model Downgrade (Zero Engineering Effort)

The biggest single lever is also the easiest to pull. If your application is using GPT-5.4 ($2.50/$15.00) for tasks that do not require frontier reasoning, GPT-5.4 mini ($0.75/$4.50) handles them at 3.3 times lower cost on both input and output. That is not a rounding error. A workload costing $3,000 per month on GPT-5.4 costs roughly $900 per month on GPT-5.4 mini for the same call volume.

The practical test is simple: run 200 representative samples from your production traffic through both models. Score the outputs using whatever quality metric matters for your use case: accuracy on a labeled test set, human review ratings, downstream task performance. For classification, extraction, summarization, and Q&A with well-defined source context, GPT-5.4 mini typically performs within a few percentage points of GPT-5.4. For tasks requiring complex multi-step reasoning, extended chains of logic, or synthesis across long documents, the gap is larger.

If you are running GPT-5.4 mini today and looking to go further, GPT-4o-mini at $0.15 per million input tokens and $0.60 per million output tokens is 5 times cheaper on input and 7.5 times cheaper on output. The quality floor is lower, but for simple, high-volume tasks, it is remarkably capable. A workload costing $900 per month on GPT-5.4 mini might cost $180 per month on GPT-4o-mini if the task complexity allows it. Run the eval and find out. The downside of a failed experiment is a few hours of engineering time. See the GPT-4o vs GPT-4o-mini comparison guide for a framework on making this decision.

Option 2: Output Length Control (Low Engineering Effort)

Output tokens cost significantly more than input tokens on all OpenAI models. On GPT-5.4, output costs $15.00 per million versus $2.50 per million for input, a 6:1 ratio. On GPT-4o-mini, it is $0.60 versus $0.15, a 4:1 ratio. Reducing output token count has an outsized impact on total cost compared to reducing input by the same amount.

Adding output constraints to your existing prompts takes 10 minutes and costs nothing. "Respond in under 100 words" cuts verbose responses. "Return only the JSON object, no explanation or preamble" eliminates all the conversational wrapper around structured outputs. "Use at most three sentences" bounds conversational replies. If your application currently generates 300 tokens of output on average and these constraints bring it to 180 tokens, you have cut output costs by 40%. On a $2,000 monthly OpenAI bill where 70% is output token cost, that is $560 per month saved from prompt edits.

Test the constrained prompts against your quality metric before deploying. For most structured tasks, shorter outputs are not worse outputs, they are more direct outputs. The model can usually answer the question in 100 words. The extra 200 tokens are hedging, context-setting, and preamble that your application discards anyway.

Option 3: Prompt Caching (Medium Engineering Effort)

OpenAI's prompt caching is automatic for prompts over 1,024 tokens. Cached tokens cost 50% less than uncached tokens. The requirement is structural: your static content, specifically the system prompt, must appear at the beginning of every request. This is the default structure for most applications, so for many teams, caching is already working without any changes.

The savings are meaningful for high-volume workloads with consistent system prompts. Consider a workload using GPT-4o-mini with a 2,000-token system prompt and 50,000 calls per day. Uncached input cost for the system prompt alone: 2,000 times 50,000 divided by 1,000,000 times $0.15 equals $15 per day. With caching, the portion of calls that hit the cache pays half that rate. Assuming a 90% cache hit rate (realistic for stateless API calls with the same system prompt), effective cost drops to roughly $8.25 per day. That is $6.75 per day in savings, or roughly $200 per month, from zero code changes if your prompts are already structured correctly.

For Anthropic models, the caching lever is even larger. Prompt caching on Claude models cuts cache read costs to 0.10x the base rate, a 90% discount rather than 50%. If you are evaluating whether to stay on OpenAI or move some workloads to Anthropic, the caching discount difference is a significant factor for workloads with large, stable system prompts. The prompt caching cost savings guide covers both providers in detail.

Option 4: Prompt Compression (Medium Engineering Effort)

After caching, the next target is the non-cached portions of your prompt: the user-specific context, the conversation history, and the query itself. Prompt compression applies systematic token reduction to this content before it reaches the API.

The three highest-impact compression techniques are system prompt consolidation (removing redundant instructions), context trimming (summarizing or truncating old conversation turns), and few-shot reduction (testing whether fewer examples maintain quality). Together, these typically reduce variable prompt content by 30 to 50%.

A concrete example: a summarization endpoint with a 1,200-token system prompt, 2,800-token document context, and 500-token user query totals 4,500 tokens per call. After compression: 400-token system prompt, 1,400-token trimmed context, 300-token query totals 2,100 tokens. A 53% reduction. At $2.50 per million input tokens on GPT-5.4 and 100,000 calls per month, that is $610 per month in input cost savings from prompt optimization. For the detailed technique breakdown, see LLM token reduction techniques.

Option 5: Cross-Provider Routing (Higher Engineering Effort)

OpenAI is not always the right provider for every call. For tasks where an alternative model performs adequately, routing those calls to a cheaper provider can produce substantial savings. The question is which tasks qualify.

Gemini 3 Flash at $0.25 per million input and $1.50 per million output handles classification, extraction, and simple Q&A at competitive quality levels for a fraction of OpenAI's cost. A call that costs $0.40 on GPT-5.4 might cost $0.05 to $0.10 on Gemini 3 Flash for straightforward tasks. Claude Haiku 4.5 at $1.00/$5.00 per million is more expensive than GPT-4o-mini but competitive for tasks where it outperforms, particularly multi-turn conversation and instruction-following scenarios.

Cross-provider routing requires maintaining API integrations with multiple providers and building quality validation for the routed calls. It is more engineering work than the earlier options, but for high-volume workloads, the payoff is proportionally larger. See the cross-provider LLM routing guide for implementation patterns that handle fallback, quality scoring, and provider selection logic.

Option 6: Batch API (Higher Effort for Some Teams, Easy for Others)

OpenAI's Batch API applies a 50% discount to all models for asynchronous workloads. Requests are submitted in bulk, and results are returned within 24 hours. For workloads that do not require real-time responses, this is one of the simplest cost reductions available.

The use cases that qualify for batching are broader than most teams assume. Nightly data enrichment pipelines, content moderation queues, document classification for uploaded files, embedding generation, and analytics processing are all candidates. If a workload runs on a schedule or processes a queue rather than responding to live user requests, it likely qualifies.

The code change is minimal: use the batch API endpoint instead of the synchronous endpoint, and handle the asynchronous result retrieval. For a workload currently costing $2,000 per month that can be moved to batch, the savings are $1,000 per month. The implementation takes a day. For more on the routing decision between batch and real-time, see the batch API routing decision guide.

What Combined Savings Look Like

Applying all six options in sequence to a representative high-volume workload:

Starting point: $5,000 per month on GPT-5.4 for mixed classification, extraction, and generation workloads.

Model downgrade (classification and extraction to GPT-5.4 mini or GPT-4o-mini where quality holds): saves 50 to 60% of those workloads' costs. New baseline: roughly $3,200 per month.

Output length control: reduces output costs by 35% across all remaining workloads. New baseline: roughly $2,500 per month.

Prompt caching: reduces system prompt costs by 50% for all calls. New baseline: roughly $2,100 per month.

Prompt compression: reduces variable token count by 40%. New baseline: roughly $1,500 per month.

Batching applicable async workloads: cuts those workloads' costs by 50%. New baseline: roughly $1,100 per month.

Cross-provider routing for tasks that pass quality checks on cheaper models: reduces remaining workload costs by another 20 to 30%. Final baseline: roughly $800 per month.

The path from $5,000 to $800 per month is not a single dramatic change. It is six incremental optimizations applied in order of effort. Each one compounds on the others. The first two take a day. The last two take a week or more. But the direction is clear, and the math holds.

PromptUnit handles options 4 and 5 automatically, compressing prompts before sending and routing calls to cheaper providers when quality thresholds are met, without requiring separate integrations for each approach.

If your OpenAI bill has grown faster than your usage or revenue, PromptUnit gives you the per-call cost data to identify where to apply these optimizations first.

Your OpenAI API Bill Is Too High: Six Options, Ordered by Effort