All posts
·7 min read

How to Reduce OpenAI Embedding Costs at Scale

Embedding costs are the most overlooked line item in AI infrastructure. At 10M embeddings per day, even cheap models compound into serious spend. Here are five techniques that cut costs 50-80%.

embeddingscost-optimizationopenaivector-search

Embedding costs are the most overlooked line item in AI infrastructure. Teams scrutinize every completion model choice while their embedding pipeline quietly compounds into one of their largest API costs at scale. At 10 million embedding calls per day, even the cheapest available model adds up fast, and most teams are paying significantly more than they need to.

OpenAI offers two primary embedding models. The text-embedding-3-small costs $0.02 per million tokens. The text-embedding-3-large costs $0.13 per million tokens. That is a 6.5x price difference. The performance difference in retrieval quality is real but much narrower than the price gap suggests for most applications, and very few teams have run the evaluation to know which side of that tradeoff they are on.

To make the scale concrete: a search product generating one million embedding calls per day at 100 tokens per document produces 100 million tokens per day. Using text-embedding-3-small, that is $2 per day, or $60 per month. Using text-embedding-3-large, that is $13 per day, or $390 per month. The difference is $330 per month and grows linearly with usage. At ten times the scale, the difference is $3,300 per month. At a hundred times, it is $33,000. None of this reflects any difference in your core product functionality. It is purely a choice about which embedding model to call, and most teams make that choice once at project start and never revisit it.

Technique 1: Model Selection Based on Your Corpus

The first and highest-leverage decision is which model to use for your specific retrieval task. For general semantic search, text-embedding-3-small retrieves the correct document 85-95% of the time on standard benchmarks. The text-embedding-3-large model offers marginal improvement in most retrieval scenarios, with gains typically in the 1-5 point range on top-1 and top-3 retrieval accuracy. Whether that gap matters depends entirely on your application and user tolerance for misses.

The correct approach is to run a retrieval quality evaluation on your actual corpus with your actual queries before deciding. Build a test set of 200-500 query-document pairs where you know the correct match. Compute retrieval accuracy at k=1, k=3, and k=5 for both models. If the quality difference does not cross a threshold that matters for your use case, use the smaller model. Most teams that run this evaluation find that small is sufficient. The ones that find they genuinely need large tend to have specialized domain vocabulary, technical jargon, or multilingual content where the larger model's representation quality provides meaningful improvement. For a comparison of embedding options across providers, see embedding model routing and cost comparison.

Technique 2: Caching

Embeddings are deterministic. The same input text with the same model always produces the same vector. This is a caching opportunity that is frequently ignored. If your product allows users to search a shared corpus of documents, many of those documents are being embedded repeatedly when they are ingested, re-indexed, or processed through pipelines that were not designed to check for existing embeddings first.

Application-level caching with a tool like Redis stores the embedding vector keyed by the hash of the input text and model name. On each embedding request, check the cache first. If the vector exists, return it. If not, call the API and store the result with an appropriate TTL. For products where users search common terms or documents are shared across users, cache hit rates of 30-50% are realistic. That translates to 30-50% of your embedding cost disappearing without any change to the quality of your search results.

User query caching is particularly effective because search queries follow a power law distribution. A small number of queries, usually the most common or obvious ones, account for a large fraction of total search volume. Caching the embeddings for these queries eliminates API calls for your most frequent traffic.

Technique 3: Batching

OpenAI's embeddings endpoint accepts an array of input texts in a single request. Sending 100 texts in one batched call versus 100 individual calls costs exactly the same in tokens, but reduces network round trips, lowers rate limit pressure, and makes your workload eligible for Batch API pricing when latency requirements allow it.

More importantly, batching creates opportunities. Once you are processing inputs in batches, it becomes straightforward to route eligible batches to OpenAI's Batch API, which offers 50% off the standard per-token rate. If your embedding pipeline is asynchronous, running during off-peak hours, or generating embeddings for documents that do not need to be searchable immediately, the Batch API cuts embedding costs in half. For a full treatment of when to use the Batch API versus synchronous calls, the decision logic applies equally to embedding workloads.

Technique 4: Dimensionality Reduction

OpenAI's text-embedding-3 models support a "dimensions" parameter that lets you specify the output vector size. The text-embedding-3-small model defaults to 1,536 dimensions but can be reduced to as low as 256. The text-embedding-3-large model defaults to 3,072 dimensions and supports similar reduction. The reduced vectors are derived from the full embedding through a technique called Matryoshka representation learning, which trains the model such that the first N dimensions of the full vector are already a useful representation.

Smaller vectors reduce two categories of cost that are distinct from the embedding API call itself. Storage cost for vector databases scales with the size of each vector times the number of stored embeddings. At 10 million documents, going from 1,536 to 256 dimensions reduces vector storage by 83%. Similarity search latency also decreases, because the inner product computation across your entire indexed corpus is proportionally faster with smaller vectors.

The quality tradeoff for dimensionality reduction is modest for most general retrieval tasks at moderate reduction levels. Reducing from 1,536 to 512 dimensions typically costs 1-3 points on retrieval accuracy. Reducing to 256 dimensions costs a bit more. Whether that is acceptable depends on your quality threshold, but for many applications it is.

Technique 5: Embed Once, Version, and Reuse

The most avoidable embedding cost is re-embedding documents you have already embedded. This happens more than it should, usually because embedding generation is part of a document processing pipeline that runs on a schedule or on every pipeline trigger, without checking whether the document has changed since it was last embedded.

Version your embeddings. Store the content hash of each document alongside its embedding vector. When your pipeline processes a document, compute its current content hash and compare it to the stored hash. If they match, skip the embedding call and use the cached vector. Only re-embed documents that have actually changed content. For most document corpora, the fraction of documents changing on any given pipeline run is small, which means most embedding calls in scheduled pipelines are redundant.

This is particularly important when switching embedding models. If you upgrade from text-embedding-3-small to text-embedding-3-large, you need to re-embed your entire corpus to generate compatible vectors. But if you are simply re-running your ingestion pipeline on a corpus that has not changed, you should not be paying for any embedding at all.

Combined Impact

These five techniques are not mutually exclusive. Applying model selection, caching, and the embed-once pattern together is the standard optimization path, and it commonly reduces embedding costs by 50-80% for a typical search product without any change in retrieval quality. The total cost of implementing all five is measured in engineering days, not weeks, and the savings begin immediately on deployment.

The teams that treat embedding costs as set-it-and-forget-it tend to discover the problem when their infrastructure bills scale with user growth at a rate they did not model. The teams that run a quick evaluation, add caching, and batch their pipeline work tend to keep embedding costs flat as a fraction of revenue even as usage grows. For a full checklist of AI cost optimizations beyond embeddings, see our AI cost optimization checklist for 2026.

PromptUnit logs embedding API calls alongside completion calls, so teams can see total embedding spend per feature and apply caching and batching rules from the same routing configuration used for their LLM workloads.

Reduce your embedding infrastructure costs with automatic batching and caching support at www.promptunit.ai.

Start your 14-day observation period

See exactly how much you'd save before paying anything. Zero risk. if we save you $0, you pay $0.

Get started free →