Prompt Caching on OpenAI and Anthropic: Real Savings Numbers
Prompt caching can cut 60% of your LLM input costs if your prompts repeat. Here is exactly how it works on both providers, when it helps, and when it does nothing.
A 2,000-token system prompt sent 100 times a day costs you the same as sending it 100 times without caching, unless you explicitly set up caching and your provider supports it. Most teams do not realize how much of their daily token spend is redundant context: the same system prompt, the same RAG documents, the same few-shot examples, repeated on every call. Prompt caching exists specifically to eliminate that waste, and on Anthropic's API the discount on a cache hit is 90%. The math changes significantly once you understand the mechanics of each provider's implementation.
How Anthropic Prompt Caching Works
Anthropic's prompt caching requires explicit opt-in. You mark sections of your prompt with a cache breakpoint, and Anthropic stores the KV cache for that prefix on their servers. Cache writes cost 1.25x the normal input token price. Cache reads cost 0.10x the normal input price, which is a 90% discount.
Working through the math for Claude Sonnet 4.6, which costs $3.00 per million input tokens and $15.00 per million output tokens: a cache write costs $3.75 per million tokens. A cache read costs $0.30 per million tokens. The break-even point is straightforward. You pay 1.25x on the first write and 0.10x on every subsequent read. If you write once and read once, your average cost across both calls is (1.25 + 0.10) / 2 = 0.675x the normal price. You saved 32.5% over two calls. By the third call, your average is (1.25 + 0.10 + 0.10) / 3 = 0.483x. By ten calls it drops to (1.25 + 9 * 0.10) / 10 = 0.215x, meaning you are paying about 21 cents per dollar of uncached cost.
Cache lifetime on Anthropic is 5 minutes by default, with the option to extend to 1 hour if you explicitly request it. This matters for the write cost. A 5-minute cache write costs 1.25x base input. If you need a 1-hour cache, the write cost is 2x base input, meaning you need at least 2 cache reads within the hour to break even.
A concrete worked example: you have a 2,000-token system prompt and receive queries with a 1,000-token user message that generate a 500-token output. You send 100 of these per day using Claude Sonnet 4.6. Without caching, your daily cost is: (3,000 tokens * 100 calls / 1,000,000) * $3.00 input + (500 tokens * 100 calls / 1,000,000) * $15.00 output = $0.90 + $0.075 = $0.975 per day.
With caching enabled on the system prompt, the first call is a cache write, and the remaining 99 calls are cache reads on the 2,000-token system prompt, with the 1,000-token user query billed at full input rate each time. The daily cost becomes: one full-price call at $0.009, plus 99 calls where only the 1,000-token query is billed at full rate ($0.297) and the 2,000-token system prompt is billed at cache read rate ($0.0059), plus all 100 output calls ($0.075). Total: approximately $0.387 per day. That is a 60% reduction in cost for a simple change to how you structure your API calls.
How OpenAI Prompt Caching Works
OpenAI's implementation is meaningfully different. It is automatic for prompts above 1,024 tokens. You do not need to set breakpoints or explicitly opt in. OpenAI's servers detect repeated prompt prefixes and cache them server-side, then automatically apply the discount to matching tokens. The discount is 50% off input token prices for cached tokens.
For GPT-4o-mini: normal input is $0.15 per million tokens, cached input is $0.075 per million tokens. There is no write surcharge. If your prompt qualifies for caching, you simply pay less on the next matching call. The simplicity is appealing, but the 50% discount is much less aggressive than Anthropic's 90% discount on reads.
OpenAI's cache also requires that the matching portion starts from the beginning of the prompt. You cannot cache an arbitrary middle section. The prefix must match exactly, including tokenization. This means even small changes to the beginning of your prompt will invalidate the cache. It also means that if you prepend per-user data to a shared system prompt, the shared portion will not cache because it is not at the prefix position.
The practical implication is that OpenAI's caching rewards a specific prompt structure: stable, long system prompt at the top, followed by variable user content. If your prompts are structured this way and exceed 1,024 tokens, you likely get some caching benefit automatically without any code changes. If your prompt structure puts variable content first, you get nothing.
When Caching Helps and When It Does Not
Caching delivers meaningful savings under a specific set of conditions. The prompt must have a stable, repeated prefix. That prefix must be long enough to make the savings meaningful, at least several hundred tokens for OpenAI (to exceed the 1,024-token threshold) and worth the write surcharge for Anthropic. And the same cached content must be sent many times, because the benefit compounds with repetition.
The use cases where caching consistently delivers are: applications with long, static system prompts that describe persona, format rules, or domain context; RAG pipelines that retrieve a fixed set of reference documents and include them in every call; few-shot examples embedded in the system prompt; and chat applications that send the full conversation history with each turn.
Caching does not help when every prompt is unique, when your prompts are short (below provider thresholds), when variable content appears before the stable content, or when call volume is low enough that the write cost is not amortized across many reads.
One case that is easy to overlook: even if your system prompt is static, if your application prepends per-session or per-user data before it, the cache never hits. The fix is to restructure your prompt so the stable content leads and the variable content follows. This is a purely structural change, but it can dramatically affect whether you capture caching benefits.
Interaction With Batch APIs
Both OpenAI and Anthropic offer Batch APIs that give 50% off all models. Prompt caching and the Batch API are not mutually exclusive. If you are running batch workloads where the same prompt prefix appears across many requests, you can benefit from both discounts simultaneously. A cached read on Claude Sonnet 4.6 via the Anthropic Batch API would cost 0.10x base input * 50% batch discount, effectively 0.05x base input price per cached token. At that rate, a 2,000-token system prompt costs $0.00030 per call versus $0.006 uncached and non-batched. The Batch API routing decision covers when batching alone is worth the latency tradeoff.
PromptUnit's Caching Layer
PromptUnit operates a semantic cache layer that works alongside provider-side prompt caching, not instead of it. Provider-side caching reduces the input token cost for repeated prefixes. PromptUnit's semantic cache goes further by storing complete responses for queries that are semantically equivalent, eliminating the token cost entirely for near-duplicate requests. The two layers are complementary: provider caching reduces cost when the prefix matches but the full query varies, while PromptUnit's response cache handles the case where two different phrasings ask the same question and the answer is identical.
For teams that want to track exactly how much each caching layer is saving, PromptUnit logs both provider-side cache hits (visible in API response metadata) and application-level cache hits separately, so you can see the contribution of each.
The numbers on prompt caching are large enough to warrant a deliberate audit of your current prompt structure. If you are running more than 10,000 LLM calls per day with any stable system prompt content, you are almost certainly leaving money on the table. The full guide to reducing OpenAI API costs covers caching alongside routing and compression as the three levers with the highest leverage. Start by looking at your top 5 most-called endpoints and checking what percentage of the tokens in each call are static. If that number is above 30%, caching should be your first optimization.
Start tracking your prompt cache hit rates in production at PromptUnit.