All posts
·9 min read

Function Calling Has a Hidden Multiplier. Why Your Agent Loop Bills 5x Your Estimate.

Tool definitions get re-sent on every turn. Tool results get re-included in context. Multi-step agent loops compound both. The result is a token bill 3-5x what a back-of-envelope calculation predicts. Here is the math, and how to route around it.

function callingai agentsllm cost optimizationtool usemodel routing

Consider a team building a customer-support agent in production. The estimate from the planning doc is $4,000/month at 50K resolved tickets, with an average of 3 LLM calls per resolution. Three months in, the actual bill is $19,000.

The calls-per-ticket assumption was right. The token count per call was wrong. By a lot.

Function calling and tool use have a cost shape that is not obvious from reading the API docs. The published price is per input and output token; the docs do not explain how aggressively a typical agent loop accumulates context, re-sends tool definitions, and re-includes tool results on every turn. The result, for almost every team building agents in 2026, is a bill 3-5x what their back-of-envelope math predicted.

This post is about where the multiplier comes from, the four most common architectural mistakes, and the routing pattern that bounds the cost without breaking the agent.

The multiplier comes from three places

1. Tool definitions are sticky.

When you register tools with OpenAI, Anthropic, or any other provider, the tool definitions count as input tokens on every call. A typical production agent has 8-15 tools. Each tool definition includes a name, description, parameter schema, and often per-parameter descriptions. A well-specified tool runs 200-400 tokens. Eight tools at 300 tokens each is 2,400 tokens of tool definitions in every prompt, before the conversation has even started.

If your agent has 5 turns in a conversation, you pay for those 2,400 tokens five times. That is 12,000 input tokens of tool definitions per conversation, none of which the user typed and none of which the model is "thinking about" in any productive way.

2. Tool results get accumulated.

In an agent loop, when the model calls a tool, the result gets appended to the conversation history. Subsequent turns include the full conversation, including every prior tool call's input arguments and its output. A tool that returns a 500-line JSON document means every subsequent turn is 500 lines longer.

For agents that fetch data (search results, database queries, API responses), the conversation context can grow by tens of thousands of tokens across a 5-turn loop. By turn 5, the prompt being sent to the model might be 30K tokens, of which 28K is tool-result history.

3. Reasoning models bill the chain-of-thought tokens.

If you are using o3 or DeepSeek R1 for the agent's reasoning step, the model's internal chain-of-thought tokens count as output. A "simple" agent decision that resolves to "call tool X with these arguments" might involve 2,000-4,000 tokens of internal reasoning the user never sees but pays for in full.

These three multipliers stack. A 5-turn agent loop with 8 tools, two tool calls per turn, and a reasoning model can bill 10x the input tokens and 5x the output tokens of the user's actual conversation length.

Where the bill actually lands

Take that customer-support agent. Plan-doc math:

  • Average ticket: 3 LLM calls
  • Average call: 1,500 input + 800 output
  • 50,000 tickets: 225M input + 120M output tokens
  • On GPT-5.4 ($2.50 / $15.00 per million tokens): $562 + $1,800 = $2,362/month

The estimate was $4K because the planning doc added a 70% buffer for variance. Reality:

  • 8 tools × 300 tokens = 2,400 tokens of tool definitions per call (always present)

  • Tool results averaging 1,200 tokens each, 1.5 tool calls per turn = 1,800 tokens accumulating per turn

  • By turn 3, prompt has accumulated 5,400 tokens of tool-result history

  • Real average call: 5,400 input + 800 output

  • 50,000 tickets × 3 calls × (5,400 + 800) tokens = ~930M tokens

  • Split roughly 750M input + 180M output

  • On GPT-5.4: $1,875 + $2,700 = $4,575/month at the low end

  • The actual bill, with longer-than-average tool results and some reasoning-model usage: $19K

The plan-doc math underestimated by 4-5x because it counted the user-visible portion of the conversation and missed the structural overhead.

The four mistakes that produce the multiplier

Mistake one: not caching tool definitions.

Tool definitions are perfect cache material. They are static, identical across calls, and consume 1,500-4,000 tokens per request. Anthropic's prompt caching at 90% off (or OpenAI's at 50% off) cuts the per-call tool-definition tax to a fraction of its previous size. Most teams know caching exists; most do not structure their tool definitions to maximize cache hits because the cache key depends on prompt prefix ordering. For tool-heavy workloads, getting that ordering right is the single highest-leverage change available before touching any routing logic.

Mistake two: passing full tool results into subsequent turns.

If a tool returns a 500-line JSON document, the agent rarely needs all 500 lines for the next turn's decision. Most teams pass the full result back because it is the easy default. The fix is a tool-result summarization step: after each tool call, run a small, cheap, fast model to extract the relevant fields and pass only those into subsequent turns. The summarization step costs a fraction of what long-tail context accumulation would. We covered the broader cost pattern in our LLM model routing guide; agent loops are the most extreme version of why per-call routing matters.

Mistake three: using one big model for all turns.

A typical agent loop has heterogeneous turns. Turn one might decide "this is a refund request, call the refund-policy lookup tool." That decision does not need GPT-5.4 or Claude Opus. Turn three might be "synthesize all the gathered context into a final response to the user." That does need a strong model.

Routing every turn to the same model means paying frontier prices for the easy turns. The fix is per-turn routing: cheap, fast models on the routing and dispatch turns; stronger models on the synthesis turns. This is the same pattern as fan-out routing, which we walked through in our Groq latency-tier analysis.

Mistake four: accumulating reasoning traces in conversation history.

When using a reasoning model for an agent step, some teams include the model's chain-of-thought tokens in the conversation history that gets sent to subsequent turns. This is rarely intentional; it usually happens because the agent framework does not distinguish between visible-to-user output and internal reasoning, and just concatenates everything.

The fix is to drop the reasoning trace from the conversation context after the tool call resolves. Keep the tool call itself and its result; drop the chain of thought. This often cuts input token count on subsequent turns by 30-60% with zero quality impact.

The routing pattern that works

For an agent with 8 tools and 3-5 turn loops, the pattern that produces the lowest bill at production-grade quality:

  • Routing turn (decide which tool to call): Llama 3.1 8B on Groq at $0.05/MTok input. With 80% cache hit on the tool definitions, the per-turn input cost is roughly $0.0001.
  • Tool-result summarization turn: Same small model. Cheap, fast, and extracts only what subsequent turns need.
  • Synthesis turn (generate user-facing response): GPT-4o-mini ($0.15/MTok) or GPT-5.4-mini ($0.75/MTok). The synthesis turn needs quality but not necessarily frontier capability.
  • Reasoning escalation (only when synthesis fails or task is genuinely complex): o3 or DeepSeek R1, called less than 5% of turns.

For the customer-support agent above, this pattern collapses the bill from $19K to roughly $2,500/month, with no measurable quality drop on resolution rate. The savings come from three places: cheap routing turns, summarized context, and avoiding reasoning models on turns that do not need them.

The instrumentation that catches the multiplier early

Most teams discover their agent-loop multiplier in the third month, when the bill arrives. The instrumentation that catches it in week one:

  • Per-turn token count, not just per-call token count. A 5-turn loop billed at 30K total tokens looks normal; a 5-turn loop billed at 30K, 32K, 35K, 38K, 40K (compounding accumulation) is a red flag.
  • Tool-definition share of total input. If tool definitions account for more than 15% of total input tokens, caching is mis-configured.
  • Tool-result share of total input. If tool results account for more than 30% of total input tokens by turn 3, summarization is missing.
  • Per-tool call cost. Not all tools are equally expensive. A tool that returns 5,000-token JSON blobs on every call deserves a separate line item in your cost dashboard.

Without these metrics, the agent-loop bill is a black box. Adding them takes a day; the savings from acting on them are usually in the 50-70% range on agent workloads, based on observed routing patterns.

The one architectural choice that bounds the worst case

If you read nothing else from this post: separate the agent's "what should I do next" decision (the routing turn) from the agent's "do the actual work" decision (the synthesis turn), and route them to different models.

The routing turn is small, frequent, and well-suited to a cheap fast model. The synthesis turn is rarer, longer, and earns a stronger model. Most agent frameworks default to "use one model for everything" because it is the simplest configuration. That default is also the one that produces the 5x multiplier on the bill.

Once you have two routing tiers wired up, the per-turn cost dynamics become visible and tunable. Every subsequent optimization (caching, summarization, reasoning-trace pruning) plugs into that same two-tier scaffolding without breaking the agent.

How PromptUnit handles this

PromptUnit's routing layer treats agent turns as first-class routing decisions. The router classifies each turn by type (routing, tool-result summarization, synthesis, reasoning escalation), and dispatches to the appropriate provider tier automatically. The dialect translation layer normalizes tool definitions across OpenAI, Anthropic, and Groq, so changing the synthesis-turn provider does not require rewriting the tool schemas. The token-inflation defense layer catches the case where a tool starts returning 50% larger payloads than baseline (often a bug in the tool implementation or a model gaming the output length), so the multiplier does not silently compound. The 14-day observation period catches the per-turn cost dynamics before any traffic shifts.

If your agent-loop bill is the line item your CFO is asking about, the per-turn routing pattern is the cleanest single intervention available. Start the free observation period at promptunit.ai and see which of your agent turns are paying for the wrong tier.

Start your 14-day observation period

See exactly how much you'd save before paying anything. Zero risk. if we save you $0, you pay $0.

Get started free →