All posts
·7 min read

LLM Cost Attribution by Feature: Why One API Key Is Costing You More Than You Know

Running your entire product off one API key with no metadata tagging means you have a monthly AI bill with no idea where it comes from. Here's how to fix that.

cost-attributionobservabilityllm-cost-managementapi-tagging

If you are running your entire product off one OpenAI API key with no metadata tagging, you have a monthly bill and no idea where it comes from. This is the default state for most early-stage SaaS companies, and it becomes a serious problem the moment someone in finance asks why AI costs grew 40% last month.

The problem with unattributed AI spend is not the total number. It is that you cannot take action on a total. You cannot optimize a monthly spend figure. You can only optimize specific behaviors, and you can only identify those behaviors if you know which features, users, and workflows are driving costs. Without attribution, you are managing a budget line the same way you would manage a utility bill, except that unlike electricity, your AI costs vary by orders of magnitude based on decisions your engineering team is making continuously.

The typical discovery when teams first add cost attribution is that one or two features account for 60-70% of total AI spend. This is almost always surprising. The feature at the top is rarely the one that was designed with cost in mind, because it was not designed with cost in mind at all. It was a side feature, or an internal tool, or a feature that never got high adoption but runs a large language model call on every page load for users who do have it enabled. Without attribution, this kind of quiet cost driver runs undetected indefinitely.

Why Attribution Is the Foundation

Cost attribution is not a direct cost-saving measure. It is the prerequisite for every other optimization. You cannot answer whether to switch a feature from GPT-4o to GPT-4o-mini without knowing that feature's current token volume and spend. You cannot evaluate whether a caching strategy is worth implementing without knowing how often the same prompt is being sent. You cannot calculate the ROI of a prompt compression project without knowing the baseline cost it would reduce. Every optimization decision downstream depends on having clean per-feature data.

The questions attribution lets you answer are the basic ones that should be answerable from day one. Which feature uses the most tokens? Which user segment drives the most AI costs? Is cost per active user growing or shrinking month-over-month? Are free-tier users consuming a disproportionate share of AI spend? Is a specific A/B variant more expensive than the control? None of these questions can be answered from a single API key's billing dashboard.

Implementation Approaches

The simplest approach for OpenAI is to pass a "user" parameter in each API call. This field accepts a string, and OpenAI surfaces per-user spend in its dashboard. It is limited to one dimension, but for products where per-user cost tracking is the primary need, it solves the immediate problem. The limitation is that you cannot also tag by feature in the same field, so this approach forces a choice between user-level and feature-level attribution.

For richer attribution, the practical pattern is to log metadata alongside each API call in your own infrastructure. On every call, record the feature name, user ID, user tier, model, token counts, and any A/B variant identifiers. OpenAI does not expose a native multi-field tagging API in the same way Anthropic does, but this is straightforward to implement at the application layer. When you receive the API response, you have the token usage in the response body. Log it with your metadata and you have a complete record.

Anthropic's API supports a "metadata" field in the request body that accepts arbitrary key-value pairs, which are attached to the request record and can be used for attribution analysis. This makes multi-dimensional tagging cleaner at the API level rather than requiring a parallel logging system.

A separate API key per feature is the simplest architectural approach and provides clean billing separation in provider dashboards without any custom infrastructure. The tradeoff is operational overhead. You need to manage key rotation, access controls, and rate limit allocation across multiple keys. For products with more than five or six distinct AI-powered features, this approach becomes difficult to maintain. It is a reasonable starting point for small teams that need immediate visibility.

Proxy-level tagging is the most scalable approach for established products. Route all API calls through an internal proxy that adds metadata fields before forwarding requests to the provider. This lets you add attribution across all callsites without modifying application code at every location. It also creates a centralized point for logging, rate limiting, and cost controls. This is how PromptUnit handles attribution, logging all API calls with tagging support so teams can attribute costs per feature without changing their application code.

What to Tag

At minimum, every API call should carry two pieces of metadata: the feature name and the user ID. Feature name tells you where costs are concentrated in your product surface. User ID lets you calculate per-user economics, which is essential for understanding unit cost as a function of plan tier or usage pattern.

Beyond the minimum, the highest-value additional tags are user tier (free, pro, enterprise), the specific model used if you have routing logic, and any A/B variant identifier if you are running experiments. Session ID is useful for multi-turn applications where you want to attribute an entire conversation to a single feature invocation rather than counting each turn separately.

Avoid over-tagging at the start. Adding fifteen dimensions to every call creates a schema maintenance burden and makes your attribution queries more complex without proportionally more insight. Start with feature and user tier. Add dimensions when a specific question you cannot answer forces you to.

What to Do With the Data

Raw attribution logs become useful when they are aggregated into a weekly cost-per-feature report. The format matters less than the habit. Pull total spend, average tokens per call, and call volume by feature for the past seven days. Sort by total spend. The top three items deserve investigation every week.

For each high-spend feature, ask three questions in order. First, is this model the right one for this task? If the feature is doing classification or extraction, it may be a candidate for a cheaper model. The GPT-4o vs GPT-4o-mini analysis provides a framework for making that call. Second, is the average token count reasonable? High average token counts often indicate a system prompt that has grown without discipline, a context window that includes more history than necessary, or output length that is not being constrained. Third, is there a caching opportunity? If the same user is sending similar queries repeatedly, or if your system prompt is large and static, caching can reduce costs significantly with minimal engineering effort.

The goal of attribution is to make these questions answerable in minutes rather than hours. A team with clean attribution data can run a cost review in a weekly engineering meeting. A team without it needs a data engineering project before they can start asking the questions. For a complete framework on LLM observability metrics, attribution is step one of the stack.

Once attribution is in place, the optimization work compounds. Each improvement can be measured against the baseline. Cost per feature becomes a metric you can track and set targets for, the same way you track latency or error rate. Without it, AI cost optimization is guesswork applied to aggregate numbers. With it, it is engineering.

Build per-feature cost visibility into your AI infrastructure at www.promptunit.ai.

Start your 14-day observation period

See exactly how much you'd save before paying anything. Zero risk. if we save you $0, you pay $0.

Get started free →