LLM Cost Tracking: The Complete Guide
What metrics to track for LLM cost visibility, how to set up monitoring, how to catch cost spikes, and how to enforce budgets. Built for engineering teams in production.
LLM cost tracking is the practice of attributing AI inference spend to specific dimensions, features, users, models, and environments, so you can understand where money goes and act on it. Without tracking, cost optimization is guesswork.
Most teams start with a monthly invoice from OpenAI or Anthropic and no visibility beneath that number. The invoice tells you what you spent. It does not tell you which feature drove the spike last Tuesday, which model tier is responsible for 70% of your bill, or whether your costs are growing linearly with users or exponentially.
This guide covers what to track, how to set it up, and what to do when costs move unexpectedly.
The Core Metrics to Track
Cost per feature
The most actionable dimension for engineering teams. When you can see that Feature A costs $0.008 per call and Feature B costs $0.47 per call, you know where to focus optimization effort.
This requires tagging every LLM call with the feature that generated it. Implementation is straightforward with a logging middleware or a proxy layer.
Cost per user (or tenant)
For SaaS products, understanding per-user LLM cost is essential for unit economics. If your average customer generates $12/month in LLM costs but pays you $29/month, your AI margin is thin. If one customer is generating $200/month in costs, you have a problem the monthly invoice cannot surface.
User-level cost tracking also enables fair-use policies, per-tier rate limiting, and cost-based pricing tier decisions.
Model distribution
What percentage of your calls are going to each model tier? If 90% of calls hit GPT-4o, you have a routing opportunity. If you have recently activated routing, model distribution tells you whether it is working as intended.
Tracking model distribution over time also surfaces drift: as new features ship, they may default to an expensive model without going through a cost review.
Task type breakdown
Which task types are most common in your traffic? What is the average cost per task type? This is the input to routing decisions. Tasks that are both high-volume and low-complexity are the highest-value routing targets.
Cost per environment
Separate tracking for production, staging, and development is basic hygiene. Unexpected cost spikes in development (a developer testing with a long context window, an automated test suite running more calls than expected) should not be invisible until the invoice arrives.
Latency by model
Cost and latency are related. Cheaper models are often faster. Tracking latency per model per task type gives you a complete picture for routing decisions: not just cost, but user-facing performance.
Setting Up Cost Monitoring
Option 1: Request-level logging with metadata tags
The fundamental approach: every LLM call is logged with metadata at the time it is made.
def call_llm(feature: str, user_id: str, messages: list):
response = openai.chat.completions.create(
model="gpt-4o",
messages=messages
)
# Calculate and log cost
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
cost = (input_tokens / 1_000_000 * 2.50) + (output_tokens / 1_000_000 * 10.00)
log_event({
"feature": feature,
"user_id": user_id,
"model": response.model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": cost,
"timestamp": datetime.utcnow().isoformat(),
})
return response
This approach works, but requires discipline across the codebase. Every call site must pass the feature and user metadata. It is easy to miss.
Option 2: Middleware layer
A better approach is to enforce metadata at the infrastructure level. A middleware that wraps all LLM calls ensures no call is logged without attribution:
class LLMMiddleware:
def __init__(self, feature: str, user_id: str = None):
self.feature = feature
self.user_id = user_id
def chat(self, messages: list, **kwargs):
response = openai.chat.completions.create(
messages=messages, **kwargs
)
self._log_cost(response)
return response
def _log_cost(self, response):
# Cost calculation and logging logic
pass
Option 3: Proxy-based tracking (recommended for production)
A proxy sits in the request path and captures cost data automatically. No code changes are required at call sites. Every request is logged with the proxy's metadata, and you can add feature/user attribution via request headers.
PromptUnit implements this at the infrastructure level: every request through the proxy is logged with model, token counts, actual cost, and the savings from routing. The data is available in the dashboard with per-feature and per-model breakdowns.
Alerting Setup
Cost tracking is only useful if anomalies surface in time to act. Three alert types cover most scenarios:
Absolute cost threshold
Alert when spend in the current billing period exceeds a fixed dollar amount. Set it at 80% of your expected monthly budget.
Alert: Monthly LLM spend has exceeded $4,000 (80% of $5,000 budget)
Percentage increase vs rolling average
Alert when daily or hourly cost is significantly above the rolling 7-day or 30-day average. This catches spikes that might not hit absolute thresholds but indicate something changed.
Alert: LLM spend in the last 1 hour is 3.2x the 7-day hourly average
This type of alert is most effective for catching unexpected traffic increases, runaway loops, or new features calling expensive models at high volume.
Per-feature cost anomaly
Alert when a specific feature's per-call cost increases significantly. A 5x increase in cost-per-call for a feature typically indicates a prompt change, context window change, or model upgrade that was not cost-reviewed.
Alert: Feature "document_analysis" average cost per call increased 4.7x
(from $0.023 to $0.108) in the last 24 hours
Cost Anomaly Detection
Common causes of unexpected cost spikes:
New feature launched without cost estimate. A feature that generates long outputs, uses a long context window, or chains multiple model calls can have 10-20x the cost of a typical feature. If cost estimation is not part of the feature review process, the first signal is the invoice.
Prompt change increased output length. A prompt that previously generated 200-token responses now generates 800-token responses after an edit. Output tokens are typically priced 3-4x higher than input tokens, so output length has outsized cost impact.
Context window growth over time. For multi-turn conversation features, context grows with each turn. A conversation that started at 500 tokens per call may be at 8,000 tokens after 20 turns. If your application does not truncate or summarize conversation history, per-call costs grow unbounded over a session.
Model upgrade without routing review. Upgrading from GPT-4o-mini to GPT-4o for one feature, or switching from Claude Haiku to Sonnet, is a 16x or 3x cost increase per call. If this change is not flagged in cost monitoring, it is invisible until the invoice.
Runaway agent loop. Agentic workflows that make tool calls and self-correct can generate many more model calls than expected. A bug that causes an agent to loop, or a task that is more complex than the retry budget assumed, can generate 50-100x the expected calls.
For context on what a $5,000/month OpenAI bill is actually composed of and what routing does to it, see OpenAI API Cost Calculator and Pricing Guide. For the case study on a SaaS team that cut costs 42%, see How a SaaS Team Cut AI Costs 42%. For the hidden costs of defaulting to a single expensive model, see The Hidden Cost of Defaulting to GPT-4o.
Budget Enforcement
Tracking tells you what happened. Enforcement prevents it.
Per-API-key budget caps
Set spending limits per API key at the provider level. OpenAI, Anthropic, and Google all support this. The risk: provider-level caps do not distinguish between features or users, so a single feature hitting its cap blocks all traffic using that key.
Per-feature rate limiting
Implement rate limiting at the feature level in your application or proxy. A feature that is expected to generate 10,000 calls/day at $0.05/call should stop accepting new requests if it has generated 12,000 calls.
Per-user spending limits
For multi-tenant or consumer-facing products, enforce a per-user LLM spend limit. A single power user who generates 100x average LLM traffic is an economics problem, not a good user story.
Pre-call cost estimation
Before making a model call, estimate the expected cost based on input token count and the model being used. If the estimated cost exceeds a threshold, either reject the request, route to a cheaper model, or require explicit confirmation.
PromptUnit's Automatic Cost Tracking
PromptUnit's proxy layer captures cost data for every request without any code changes. The dashboard shows:
- Cost per model per day/week/month
- Model distribution across your traffic
- Savings from routing vs what would have been spent without routing
- Per-request metadata including model, tokens, cost, and routing decision
For teams that tag requests with feature and user metadata via headers, the dashboard breaks down costs along those dimensions as well.
The cost tracking is a byproduct of the routing infrastructure, not an add-on. Every request that passes through the proxy is automatically logged and attributed.
Try It Free
See exactly where your AI budget is going. PromptUnit's 14-day observation period shows you the savings before you commit to anything.
Try the live demo — no API key needed. Or talk to us if you want a walkthrough.