LLM Observability: The Metrics That Actually Drive Decisions

Most teams track monthly LLM spend. That metric tells you almost nothing about what to do next. You know you spent $8,400 last month. You don't know whether it was a single poorly designed feature burning $3,000 of that, whether your cache hit rate dropped because someone changed a system prompt, or whether your retry rate has been quietly adding 15% to every invoice for two months. The monthly total is a budget metric. It is not an optimization metric.

Production LLM systems need a different kind of observability, one built around the decisions you actually need to make: which feature to optimize, which model to replace, where latency is hurting users, and which calls are being retried silently.

The Metrics That Lead to Action

Cost per call by feature is the most important metric for cost optimization. Not total spend, and not average cost per call across your entire system, but cost per invocation broken down by the product feature that triggered the call. In practice, one feature almost always dominates. A poorly designed onboarding flow that runs a large model with a lengthy prompt, a report generation feature that passes full document text into context, a search ranking function that calls GPT-5.4 ($2.50/1M input, $15/1M output) when it could use GPT-4o-mini ($0.15/1M input, $0.60/1M output), any of these can account for 30-50% of total spend. Without per-feature attribution, you cannot find them.

Cost per output unit converts your LLM spend into meaningful business terms. The right unit depends on your product: cost per 1,000 words generated, cost per document processed, cost per user query answered. This is your AI unit economics. If your cost per document processed is $0.35 and you charge $0.10 per document, you have a margin problem regardless of your monthly total. If it's $0.04, you have pricing power. Monthly spend tells you neither of those things.

Token efficiency ratio measures how much useful work you're getting out of the tokens you send. A simple formulation: divide useful output tokens by total tokens in the call (input plus output). If you're sending a 4,000-token context window to extract a 30-token structured field, your ratio is terrible and the fix is probably prompt restructuring or retrieval redesign rather than model switching. Teams with good token efficiency ratios have typically done the work of trimming system prompts, removing redundant few-shot examples, and pruning context windows. Teams that haven't done that work often find that prompt optimization saves more money than any model change.

Quality score by model is what makes routing decisions defensible. If you have any quality signal at all, whether human labels, an LLM-as-judge evaluation, or a downstream proxy like conversion rate or user correction rate, track it alongside cost per call. The combination tells you whether a cheaper model is actually cheaper on a quality-adjusted basis. A model that costs 60% less but produces output that requires human correction 30% of the time may cost more when you factor in remediation. This metric is the foundation of the LLM model routing decisions that actually hold up in production.

Retry rate is one of the most undertracked metrics in production LLM systems. When a call fails due to a timeout, a 500 error, a malformed output that fails validation, or a rate limit hit, your system retries. Each retry is a cost you didn't budget. A 10% retry rate means your effective cost is 10% higher than your raw token cost. A 25% retry rate, which is not uncommon in systems with strict JSON output requirements and no structured output mode enabled, means you're paying for a quarter more calls than your usage suggests. Track this separately from raw error rate, because a call that succeeds on retry still appears as a success in your quality metrics.

Cache hit rate tells you how much you're actually saving from caching, which is distinct from how much you could theoretically save. If you've implemented prompt caching and your hit rate is 40% when you expected 80%, something is invalidating your cache more often than expected, possibly a timestamp or user-specific variable embedded in what you thought was a stable system prompt. Anthropic's prompt caching costs $0.30/MTok for reads on Sonnet 4.6 versus $3.00/MTok uncached. If your cache isn't hitting, you're leaving that savings on the table. You can find the detailed math on cache economics in the prompt caching cost savings guide.

Latency by percentile is a production health metric more than a cost metric, but it belongs in any complete observability stack. The p50 tells you the median experience. The p95 and p99 tell you what your power users or complex queries encounter. Average latency almost always looks acceptable because it smooths over the long tail. A p99 of 45 seconds on a synchronous user-facing call is a product problem. A p99 of 45 seconds on a batch pipeline is fine. Without percentile breakdowns, you cannot tell the difference.

Metrics That Feel Useful But Aren't

Total monthly spend is a budget metric. Use it to set limits, flag anomalies, and report to finance. Do not use it to decide what to optimize. By itself it tells you nothing about where to look.

Average token count per call looks like an efficiency metric but is nearly meaningless without segmentation. Your code assistant feature might average 8,000 tokens per call while your classification feature averages 200 tokens. An average of those two tells you nothing about either. The moment you segment by feature, average token count becomes useful. Without segmentation, it's noise.

Model version distribution is worth tracking if you're actively routing between models, but as a standalone metric it's descriptive rather than prescriptive. Knowing that 60% of your calls go to Sonnet 4.6 and 40% go to Haiku 4.5 only matters if you know the quality and cost profile of each bucket. Without cost and quality attached, it's an inventory metric, not an optimization input.

How to Instrument This

Tag every API call with a consistent set of fields: user_id, feature_name, model, input_tokens, output_tokens, latency_ms, a boolean for whether the response was served from cache, and quality_score where you have it. This schema works whether you're storing events in a time-series database, a data warehouse, or a simple Postgres table with a created_at timestamp index.

Build two separate views from this data. An operational view updates hourly and shows latency percentiles, error rate, and retry rate per feature. You look at this when something feels wrong. A financial view aggregates daily and shows cost per feature, week-over-week trend, and cost per output unit. You look at this when planning optimization work.

The engineering investment to build this from scratch is non-trivial, which is why many teams defer it until costs are already painful. PromptUnit logs all of these fields automatically per call and surfaces cost-per-feature breakdowns in a pre-built dashboard, so you can start with this visibility on day one rather than building toward it.

Connecting Metrics to Decisions

The value of this observability stack is that it creates a clear path from data to action. High cost per call on a specific feature points to model selection or prompt redesign for that feature. Low token efficiency ratio points to prompt restructuring. High retry rate points to output validation issues or provider reliability problems, which leads directly to the question of how to build failover between providers. Low cache hit rate points to prompt structure problems. Poor quality score on a cheap model points to either upgrading the model for that task or improving the prompt.

Each metric has a natural next step. Monthly spend does not.

Teams that build this instrumentation early find that the data actively shapes their engineering priorities. The feature you assumed was expensive turns out to be efficient. The feature you ignored turns out to be costing three times what you thought. That reordering of priorities, grounded in actual numbers rather than assumptions, is what good LLM observability produces.

Start measuring cost-per-feature today with PromptUnit at promptunit.ai.

The Metrics That Lead to Action

Metrics That Feel Useful But Aren't

How to Instrument This

Connecting Metrics to Decisions

Related posts