Gemini 2.5 Flash Holds 1M Tokens at $0.30 Input. The Long-Context Routing Decision Most Teams Skip.
Gemini 2.5 Flash supports a 1M token context window at $0.30 input and $2.50 output per million tokens. The right routing decision is not 'always RAG' or 'always long-context.' It depends on three questions most teams never ask.
Gemini 2.5 Flash supports a 1M token context window at $0.30 per million input tokens and $2.50 per million output. That means you can stuff roughly 750,000 words of context into a single API call for $0.30. A whole codebase, a stack of legal contracts, a year of customer support transcripts, all in one prompt.
The temptation is obvious: skip the RAG pipeline entirely. No embedding model, no vector database, no chunk retrieval, no reranking. Just feed the whole corpus and let the model figure it out.
The temptation is also wrong, most of the time. Long-context routing is not a replacement for RAG. It is a different routing target with a different cost-quality-latency profile, and the decision of which workloads belong on which target is more nuanced than "Gemini is cheap, use it for everything."
This post is about the three questions that determine whether long-context Gemini, RAG-augmented mid-tier models, or some hybrid is the right route for a given workload.
Question one: how many distinct queries hit the same context?
This is the dominant cost driver. If you load 500K tokens of context into a Gemini 2.5 Flash call to answer one question, you pay $0.15 for the input plus a small amount for the output. If you load that same 500K tokens 100 times to answer 100 different questions about the same corpus, you pay $15.
A RAG pipeline that retrieves and feeds only the relevant 5K tokens per query, even with embedding generation costs, sits closer to $0.50 across all 100 queries. The break-even is sharp: if your corpus is queried more than roughly 10 times per "load," RAG wins on cost. If it is queried 1-3 times, long-context wins.
The implication for routing: workloads where one user uploads a document and asks a small number of questions about it (under 10) belong on long-context. Workloads where many users query a stable shared corpus repeatedly belong on RAG.
This is also where prompt caching changes the math. If you can cache the 500K-token context portion of the prompt, the per-query input cost on subsequent calls drops to $0.03 (one-tenth of the standard rate via Gemini's context caching API, which works both explicitly and via automatic implicit caching on 2.5 models). For a stable shared corpus, cached long-context can compete with RAG even at high query volumes — this is the workload where the caching tier earns its keep most dramatically.
Question two: does the answer require synthesis across the whole corpus?
This is the quality dimension that RAG cannot match.
Consider three queries against a 500K-token corpus of customer support tickets:
- "What is the most common reason customers cancel?"
- "Give me a summary of ticket #4827."
- "How has the volume of refund requests changed over the last six months?"
Query 1 requires aggregating signal across the entire corpus. RAG retrieval will fetch a handful of tickets that mention cancellation and miss the implicit pattern across thousands of tickets that did not use the word "cancel." Long-context handles this; RAG mangles it.
Query 2 needs only the one specific ticket. RAG retrieves it for $0.0001 in embedding cost; long-context loads the whole corpus at 100x the cost for the same answer.
Query 3 needs a temporal aggregation across many tickets, similar to query 1. Long-context wins again.
The routing pattern: if the query asks about a specific identified item, RAG. If the query asks about the corpus as a whole, long-context. Most production Q&A endpoints handle both kinds of queries; the routing layer should classify per-query and dispatch accordingly.
This is a harder routing decision than the cost-only version, because it requires a query-classification step. The classification can run on a small model (Llama 3.1 8B on Groq, GPT-4o-mini, Gemini Flash-Lite) at minimal cost. The classification accuracy needed is roughly 85-90%; misclassifications fall back to the safer (slightly more expensive) long-context route.
Question three: what is the latency budget?
Long-context inference is slow. A 500K-token Gemini Flash call has a time-to-first-token in the multi-second range, even with Google's optimized infrastructure. A RAG-retrieved 5K-token call returns first tokens in well under a second.
For batch workloads (overnight summarization, scheduled report generation), the latency does not matter. Long-context wins on simplicity. For user-facing chat, the latency is a product killer. RAG wins almost regardless of the cost math.
There is a middle category: workloads where the user uploaded a document and is willing to wait 5-10 seconds for the first answer. Long-context is acceptable here because the user has explicitly opted into a "process this big thing" interaction. After the first answer, follow-up questions can be served from a cached prefix at much lower latency, which makes long-context with caching the right choice for "chat with this document" workloads.
The three patterns that actually win
After working through hundreds of routing decisions for long-context vs RAG, three patterns capture the majority of production wins:
Pattern 1: Document-scoped chat (long-context with caching).
User uploads a document. Asks the first question. The system loads the full document into Gemini Flash, caches the prefix, and answers. Subsequent questions on the same document hit the cache. Cost-per-conversation is dominated by the first call; subsequent calls are nearly free. Latency is acceptable because the user is already in "explore this document" mode.
Pattern 2: Corpus-wide analytics (long-context, batch).
Aggregations across a large corpus that need synthesis. Run as batch workloads when the corpus updates. Output gets written to a database or pre-computed cache. User-facing queries hit the pre-computed result, not the model. Long-context wins here because synthesis quality matters more than per-query cost.
Pattern 3: User-facing Q&A on a shared corpus (RAG).
Customer support, internal knowledge base, e-commerce search. Many users, many queries, stable corpus. RAG wins on cost, latency, and the "many small queries" pattern long-context cannot beat.
What does not work: the "load the whole world into context" pattern people are tempted into when they see the 1M-token number. Long-context is a routing target for specific workload shapes, not a replacement for retrieval.
What this looks like in a routing layer
The routing decision for a Q&A endpoint is two layers deep:
- Classify the query: specific-item lookup, corpus-wide synthesis, or user-uploaded-document scope?
- Route accordingly: RAG for lookups, long-context for synthesis, long-context-with-caching for documents.
If your stack has only one route ("send everything to Gemini Flash with full corpus" or "send everything through RAG"), you are paying somewhere on every query. The two-layer routing pattern is what we walked through in our complete LLM model routing guide; long-context is the routing target the original guide undersold, because the per-token economics shifted only after Gemini 2.5 Flash dropped its pricing to $0.30 input.
The cost ceiling on getting this wrong
For a B2B SaaS team running a typical knowledge-base Q&A endpoint with 1M monthly queries against a 500K-token corpus:
- All-long-context (no routing classification): 1M queries × 500K tokens × $0.30/M = $150,000/month
- All-long-context with aggressive caching: roughly $20,000-$30,000/month, depending on hit rate
- All-RAG: roughly $500-$2,000/month plus embedding/vector-db infrastructure
- Two-layer routed (most queries to RAG, synthesis queries to long-context): $1,500-$4,000/month
The gap between "default long-context" and "routed properly" is two orders of magnitude. Most teams do not deploy default long-context because the bill alone catches their attention. The mistake is more often the inverse: routing everything through RAG and missing the synthesis-quality gains long-context provides on a small but high-value query subset.
How PromptUnit handles this
PromptUnit's routing layer treats long-context Gemini and RAG-augmented mid-tier models as different nodes in the routing graph. The query classifier runs on a small model in the inference path, classifies the query type, and dispatches. The quality fingerprint catches the case where the classifier picks RAG for a query that needed synthesis (the answer quality drops, the routing decision gets re-weighted on future similar queries). The dialect translation layer normalizes the request format so customer code stays the same regardless of which provider the routing decision lands on.
If you are running a Q&A workload and have not done the routing classification work, the savings from getting this right are typically 50-90% on that endpoint. Start the free observation period at promptunit.ai and see how your current Q&A traffic splits across the three patterns above.