Six Techniques for Sending Fewer LLM Tokens Without Losing Quality
Prompt compression, context trimming, few-shot reduction, output length control. These techniques reduced one production prompt from 4,500 to 2,100 tokens. Here is how to apply them systematically.
A typical unoptimized production prompt for a document summarization task runs to 4,500 tokens: roughly 1,200 in the system prompt, 2,800 of document context, and 500 tokens in the user query. After applying the techniques in this article, the same prompt compresses to 2,100 tokens while producing equivalent output quality. That is a 53% reduction in input tokens on every call. At Claude Sonnet 4.6 pricing of $3.00 per million input tokens, that saves $7.20 per 10,000 calls. At 100,000 calls per month, it saves $720 per month from one prompt type. Most production systems have dozens of prompt types.
The direct approach to cutting your LLM bill is to send fewer tokens. This is obvious in principle but almost no team does it systematically. The reason is that optimization requires measurement, and most teams ship prompts without ever auditing the token count or testing whether every section of the prompt is earning its cost.
System Prompt Compression
The system prompt is the highest-leverage target because it is sent on every call. A 1,000-token system prompt at Sonnet 4.6 pricing costs $3.00 per million calls, or $3.00 for every 1,000,000 calls. That seems small until you are at 10,000 calls per day: the system prompt alone costs $30 per day, or $900 per month.
Most system prompts have significant redundancy. Common patterns include multiple sentences that say the same thing in slightly different ways, structural preambles that do not affect model behavior ("You are a helpful assistant that..."), and instruction lists that could be combined. Run your system prompt through a deliberate compression pass: identify every instruction and ask whether it changes what the model does. If two instructions address the same behavior, merge them. Remove filler phrases that do not constrain model outputs.
A 1,200-token system prompt can often be compressed to 400 to 600 tokens with no measurable quality change. That is a 50 to 67% reduction on the most frequently sent component of your prompt. Test the compressed version on your evaluation set before deploying it to production, but in practice the main risk is removing an instruction that matters more than it appears to. Run 200 representative samples through both versions and compare outputs.
Context Trimming for Conversation History
Conversation history grows with every turn. After five turns, a chat application might be sending 4,000 tokens of conversation context on every request. After ten turns, that number doubles. The model rarely needs the full history to handle the current turn. It needs the recent context and the key facts established earlier in the conversation.
Three approaches work for different use cases. The simplest is a rolling window: keep only the last N turns in context. For most conversational applications, the last three to five turns contain everything the model needs to respond coherently. Turns older than that can be dropped. This is not always appropriate, specifically it fails when the user references something from early in the conversation, but for many applications it works well.
A more robust approach is turn summarization: rather than dropping old turns, compress them into a compact summary. "The user asked about pricing for the enterprise plan. We confirmed it is $200/month for up to 50 seats." That is 20 tokens representing what might have been 400 tokens of back-and-forth. Summarization adds a small cost for the compression step but saves more on every subsequent call in the session.
The third approach, retrieval over conversation history, is higher engineering effort but powerful for long sessions. Rather than maintaining a linear conversation context, store turns in a vector database and retrieve the most relevant prior exchanges based on the current query. This keeps input tokens bounded regardless of conversation length. For production systems with long-running sessions, this is often the right architecture. For more on the cost implications of retrieval-based approaches, see the RAG pipeline hidden costs post.
Few-Shot Example Reduction
Few-shot examples are expensive tokens. A single well-constructed example might be 200 to 400 tokens. Five examples cost 1,000 to 2,000 tokens, sent on every call. The question is whether five examples meaningfully outperforms two examples for your specific task.
Test this empirically. Run your evaluation set with zero-shot, one-shot, two-shot, and five-shot prompts. Plot the quality metric against token count. For most tasks, the quality curve flattens quickly. Two-shot often achieves 80 to 90% of the quality improvement that five-shot provides, at 40% of the token cost. One-shot frequently captures half the benefit at 20% of the cost.
If you are using five examples today because "more examples must be better," you are likely paying for three unnecessary examples on every call. The empirical test takes an afternoon. The savings persist indefinitely.
Output Length Control
Output tokens are more expensive than input tokens on most providers. Sonnet 4.6 charges $15.00 per million output tokens versus $3.00 for input, a 5:1 ratio. Reducing output length has a proportionally larger impact on total cost than reducing input length.
LLMs will produce longer outputs than necessary by default. They add hedging language, structural padding, and explanatory text that the calling application never uses. Explicit output constraints in the prompt address this directly. "Respond in two to three sentences" for short answers. "Return only the JSON object, no explanation" for data extraction. "Use at most 100 words" for summaries. These instructions directly reduce output token counts without affecting the utility of the response.
Test with and without output constraints on your evaluation set. For structured tasks like data extraction and classification, constrained outputs are often more accurate, not just shorter, because the model spends less time generating preamble and gets directly to the structured content.
Removing Unnecessary Formatting Instructions
Instructions like "use markdown headers, bullet points, and bold text for key terms" add formatting tokens to every model output. A response that would be 150 tokens as plain prose might be 220 tokens with markdown formatting applied. For applications that render plain text, or that strip HTML before displaying responses, those formatting tokens are pure waste.
Audit your prompts for formatting instructions. If your application does not render markdown, remove markdown instructions. If your application only uses the text content of the response and discards any structure, do not ask for structure. For API responses that feed into downstream processing pipelines, plain prose or JSON is almost always more compact than formatted prose.
This is also relevant for system prompts that ask models to structure their reasoning: "think step by step and show your work" style instructions produce more tokens than direct instruction prompts. Chain-of-thought prompting is valuable when you need the reasoning trace, but if you only care about the final answer, routing to a model that handles the reasoning internally and returns a direct answer is cheaper.
Structured Output Modes
OpenAI's JSON mode, Anthropic's tool-use structured outputs, and similar features from Google produce denser, shorter outputs than conversational responses for data extraction tasks. When you ask a model to extract five fields from a document and return them in a conversational format, it typically produces something like: "Based on the document, the contract date is January 15, 2026. The parties involved are..." That framing adds 30 to 50 tokens of overhead per field.
Structured output mode for the same task returns: {"contract_date": "2026-01-15", "parties": [...]}. The content is identical but the wrapper is minimal. For high-volume extraction pipelines, the difference compounds. Ten fields per extraction call with 40 tokens of conversational overhead per field means 400 tokens of waste per call. At Sonnet 4.6 output pricing, that is $6 per 1,000 calls in avoidable costs.
What Combined Optimization Looks Like
The summarization prompt example from the opening gives a sense of combined impact. Starting position: 1,200-token system prompt, 2,800-token document context, 500-token user query, totaling 4,500 input tokens. After applying these techniques: the system prompt compresses from 1,200 to 400 tokens. The document context is trimmed using retrieval to the most relevant sections, reducing it from 2,800 to 1,400 tokens. The user query is tightened from 500 to 300 tokens. Total: 2,100 tokens. A 53% reduction.
For the output side, adding "return a three-paragraph summary, no headers or formatting" reduces average output length from 350 tokens to 180 tokens. Combined input and output savings at Sonnet 4.6 pricing: (4,500 minus 2,100) times $3.00 per million plus (350 minus 180) times $15.00 per million, which equals $7.20 plus $2.55 equals $9.75 savings per 1,000 calls. At 100,000 calls per month, that is nearly $1,000 in monthly savings from one optimized prompt.
The reduce OpenAI API costs guide covers the broader cost-reduction toolkit including model selection and caching, which stack on top of these token reduction techniques.
PromptUnit's compression layer automates techniques one and two, applying prompt compression and context trimming before requests reach the provider, which gives teams the token savings without having to manually rework every prompt.
If your token counts have grown without a corresponding increase in output quality, PromptUnit can help you measure where the waste is and apply compression systematically.