The Hidden Costs in Your RAG Pipeline (And How to Cut Them)

At 100,000 LLM calls per day with Claude Sonnet 4.6, a naive RAG implementation that retrieves 10 chunks of 200 tokens each adds 200 million input tokens to your daily spend. At $3.00 per million tokens, that is $600 per day in retrieved context alone, or roughly $18,000 per month. That cost has nothing to do with the quality of your LLM responses. It is pure overhead from a retrieval strategy that pulls too much context and does not compress it before sending. Most teams building RAG systems do not see this because they account for LLM costs as a single line item, not broken down by where the tokens are coming from.

RAG pipelines have four distinct cost centers. Treating them as one makes it nearly impossible to optimize any of them.

Cost Center 1: Embedding Calls

Embedding costs are often dismissed as negligible, and at small scale they are. At text-embedding-3-small pricing ($0.02 per million tokens), a one-time indexing job on 1 million chunks of 200 tokens each costs $4.00. That is genuinely cheap. The ongoing query cost at 10,000 users per day, each sending a 50-token query for embedding, is 500,000 tokens per day, or about $0.01 per day. Not worth optimizing.

The embedding cost problem surfaces in two scenarios. First, if you re-embed your document corpus frequently, because documents are updated often or because you are experimenting with different chunking strategies, the indexing cost can grow significantly. If you are re-embedding a 50 million token corpus weekly, that is $1.00 per run with a cheap embedding model, but it can be $100 per run or more if you are using a more capable (and more expensive) embedding model. Second, if you are using embeddings as part of a multi-stage retrieval pipeline that embeds intermediate results or reformulated queries, the per-query embedding cost multiplies by the number of stages.

The fix is simple: cache embeddings. Document chunk embeddings are deterministic. If the chunk content has not changed, the embedding has not changed. Storing embeddings in a vector database with a content hash as the key means you only compute new embeddings when documents actually change. This is so obvious it feels redundant to say, but a surprising number of pipelines re-embed on every ingest without checking for existing embeddings.

Cost Center 2: Context Window Bloat

This is the dominant cost driver in most production RAG systems, and it is the one most directly under your control.

The standard retrieval pattern fetches the top-k most semantically similar chunks to the user query. A common default is k=10, with chunks of 200-300 tokens each. That adds 2,000 to 3,000 tokens of retrieved content to every LLM call, on top of your system prompt, the user query, and any conversation history. At high call volume, this compounds fast.

Consider the math more precisely. If your system prompt is 500 tokens, your user query averages 100 tokens, and you retrieve 10 chunks at 250 tokens each (2,500 tokens), your total input per call is 3,100 tokens. If you cut retrieval from 10 chunks to 3 chunks (750 tokens), your total input drops to 1,350 tokens. That is a 56% reduction in input tokens per call, which directly translates to a 56% reduction in input token cost. On a 100,000 call per day workload with Claude Sonnet 4.6, that difference is roughly $325 per day, or $9,750 per month, from changing a single integer in your retrieval configuration.

The risk of reducing k is that you retrieve fewer relevant chunks and degrade answer quality. This is where reranking becomes relevant.

Cutting k is not the only lever, though. For some workloads the better fix is skipping retrieval altogether in favor of a long-context model with prompt caching. Gemini 2.5 Flash's 1M-token context window at $0.30 per million input tokens changes the RAG-versus-long-context tradeoff for document-scoped chat and corpus-wide synthesis queries, though it is not a universal replacement for retrieval on high-volume, many-query workloads.

Cost Center 3: Reranking

A reranker is a cross-encoder model that takes your retrieved candidates and scores them for relevance to the specific query, more accurately than embedding similarity alone can. Services like Cohere Rerank are purpose-built for this. Reranking 10 candidates to identify the top 3 costs a small amount per query, but it allows you to retrieve a larger initial candidate set with cheap vector similarity search (which is nearly free) and then aggressively filter to a small, high-relevance context window.

The economics: reranking costs money, but sending 3 high-quality chunks instead of 10 medium-quality chunks to your LLM saves substantially more in input token costs. At scale with expensive models, the reranking cost is typically 10-20% of the token savings it enables.

There is a secondary quality benefit: smaller, more relevant context windows often produce better outputs. Models can focus on fewer, more relevant passages rather than synthesizing across 10 chunks of varying relevance. This means lower hallucination rates and better grounding, which reduces downstream costs from bad outputs requiring regeneration.

Cost Center 4: Multi-Round Retrieval

Some RAG architectures perform multiple retrieval and generation cycles per user query. A query decomposer breaks a complex question into sub-questions, each sub-question triggers a retrieval round, intermediate answers are synthesized, and a final generation pass produces the response. Each round is a full LLM call with a full context window of retrieved content. A 3-round pipeline costs roughly 3x what a 1-round pipeline costs at the same retrieval settings.

Multi-round RAG can significantly improve answer quality on complex, multi-faceted questions. The question is whether the quality improvement justifies the cost multiplier for your specific use case. A good benchmark is to run both pipeline configurations on a representative sample of real queries from your application, score the outputs on your quality criteria, and then calculate the cost-per-quality-point tradeoff. For many SaaS applications, the majority of queries are simple enough that single-round retrieval with a good reranker matches or exceeds multi-round quality at a fraction of the cost.

If you need multi-round retrieval for complex queries, route only those queries to the multi-round pipeline. Use a cheap classifier (GPT-4o-mini at $0.15/M tokens works well for query complexity scoring) to determine which pipeline to invoke. Simple queries go to the single-round path, complex queries go to multi-round. This is the same model routing logic described in the LLM model routing guide, applied to pipeline architecture rather than just model selection.

Cutting All Four Simultaneously

The compounding opportunity is real. A team that implements all four optimizations: embedding caching, reduced k with reranking, and single-round routing for simple queries, can cut RAG infrastructure costs by 60 to 80% without meaningful quality degradation. The exact savings depend on your current configuration, model choices, and query distribution.

Context compression is another lever worth implementing once you have the other optimizations in place. Before inserting retrieved chunks into the LLM context, run them through a compression step that removes redundant sentences, extracts only the most query-relevant spans, and truncates to a maximum token budget. This is not summarization, which would lose information, but structural compression that removes tokens that are not load-bearing for the query at hand. PromptUnit applies context compression as part of its proxy layer, reducing the tokens sent to the LLM for retrieved context without requiring changes to your retrieval or application code.

The interaction between RAG cost optimization and model choice is also worth considering carefully. If you are running RAG on Claude Sonnet 4.6 ($3/$15 per MTok) for tasks that do not require its full capability, routing to a cheaper model for simple retrievals can compound the savings from context compression. For a complete view of model selection tradeoffs, see the Anthropic Claude API pricing guide alongside the general LLM model routing guide.

Start with cost center 2. Measure your average retrieved token count per call today. Cut k from whatever it is to 3-5, add a reranker if you do not have one, and measure quality on a held-out test set. The context window reduction is almost certainly your fastest path to meaningful cost savings.

If you want to track per-call context token counts and see the distribution of your retrieved context sizes in production, PromptUnit gives you that visibility without changing your application code.

Cost Center 1: Embedding Calls

Cost Center 2: Context Window Bloat

Cost Center 3: Reranking

Cost Center 4: Multi-Round Retrieval

Cutting All Four Simultaneously

Related posts