The Multi-Agent Cost Spiral: Why Agentic AI Systems Cost 10-100x More Than You Expect
Every step in an agent loop compounds token count. A 20-step agent can easily cost 50x more than a single-call equivalent. Here is how the math works and how to design agent systems that don't blow the budget.
Most teams that build their first multi-agent AI system are shocked by the first billing cycle. A workload that costs $0.02 per run as a single-call prompt can easily reach $0.45 per run once it becomes a 20-step agent loop. That is a 22x cost increase for the same underlying task. The technical explanation is straightforward once you see it, but it is not obvious until you have built a few agents and watched the token counts.
Why Token Count Compounds Across Agent Steps
The core issue is that every step in an agent loop does not start from scratch. It starts with everything that came before it: the original task description, the outputs of every prior step, every tool call result, and every intermediate reasoning trace. The conversation history accumulates, and you send it in full with every new request.
Walk through a simple five-step agent. The initial task is 1,000 tokens. Step 1 receives 1,000 tokens of input, generates 200 tokens of output, and calls a tool that returns 500 tokens of results. Step 2 now starts with 1,000 tokens of original context plus 200 tokens from step 1's output plus 500 tokens of tool results, totaling 1,700 tokens of input. Step 2 generates 300 tokens and calls a tool that returns 400 tokens. Step 3 starts with 2,400 tokens of input. By step 5, you are sending over 4,000 tokens of accumulated context just to handle the next incremental step in the task.
Add up the total input tokens across all five steps: 1,000 plus 1,700 plus 2,400 plus 3,200 plus 4,000 equals 12,300 tokens. A single-call equivalent that sends the full task context and receives the final answer might use 1,500 tokens of input. The five-step agent used 8.2 times more input tokens to accomplish the same task.
At 20 steps, the compounding is severe. If the average context grows by 600 tokens per step (a mix of model output and tool results), step 20 starts with 1,000 plus 19 times 600, which equals 12,400 tokens of input context. The cumulative input token count across all 20 steps, assuming linear growth, is approximately 20 times 6,700 (average context size across all steps), which equals 134,000 tokens. The single-call equivalent is still around 1,500. That is roughly 90 times more input tokens.
The Dollar Impact at Production Scale
Using Claude Sonnet 4.6 at $3.00 per million input tokens and $15.00 per million output tokens, here is what a realistic 20-step agent costs per run. Each step sends an average of 5,000 tokens of context and generates 500 tokens of output. Input cost: 20 steps times 5,000 tokens divided by 1,000,000 times $3.00 equals $0.30 per run. Output cost: 20 steps times 500 tokens divided by 1,000,000 times $15.00 equals $0.15 per run. Total: $0.45 per agent run.
At 1,000 runs per day, that is $450 per day, or $13,500 per month. The comparable single-call task, generating 500 output tokens with 1,500 tokens of input, would cost about $0.015 per run, or $450 per month. The agent system costs 30 times more per month for the same task throughput.
For teams running agents at scale, this is not a theoretical concern. It is why multi-agent architectures require fundamentally different cost discipline than single-call applications.
The Patterns That Make It Worse
Several common agent design patterns amplify the token multiplication problem beyond the baseline compounding effect.
Re-reading the full task description at every step is one of the most wasteful patterns. If your agent's system prompt or task description is 2,000 tokens and it is re-injected at every step, you are paying for those 2,000 tokens 20 times. This is a direct candidate for prompt caching, which cuts the cost of those repeated tokens by 90% on Anthropic models. See the prompt caching cost savings guide for implementation details.
Keeping all tool outputs in context even after they are no longer needed is another common issue. If step 3 calls a database query and retrieves 1,000 tokens of results, and those results are only relevant for steps 3 and 4, there is no reason to carry them forward through steps 5 through 20. Most agents do this by default because it is the path of least resistance. Every unnecessary token in context from a prior tool call adds to every subsequent step's token count.
Using frontier models for every step is perhaps the most avoidable cost amplifier. A 20-step agent typically has a mix of task types: some steps require genuine reasoning, while others are classification, routing, formatting, or simple extraction. Running all of them on Opus 4.8 or Claude Sonnet 4.6 when a Haiku 4.5 or GPT-4o-mini call would handle the simple steps adequately means paying premium rates for work that does not require it.
Not having early-exit conditions is the fourth common pattern. Agents that run a fixed number of steps regardless of whether the task is complete burn tokens on unnecessary cycles. If the task is solved at step 7, steps 8 through 20 are waste. Building confidence thresholds or completion-check logic that terminates the agent loop early is a straightforward optimization that many teams skip in the interest of shipping faster.
Design Patterns That Control Costs
The most impactful change is mixing model tiers within a single agent pipeline. Map out every step in your agent and classify it: does this step require reasoning, synthesis, or judgment? Or is it routing, classification, formatting, or simple lookup? The latter category can run on Haiku 4.5 at $1.00/$5.00 per MTok instead of Sonnet 4.6 at $3.00/$15.00. On a 20-step agent where 12 steps are simple and 8 require reasoning, you can cut your model costs by 50 to 60% without changing the quality of the final output.
Context pruning is the second high-leverage change. Rather than carrying every tool output and step result forward indefinitely, summarize older steps into a compact intermediate state. A 500-token summary of steps 1 through 5 is cheaper to carry than the 3,000 tokens of raw outputs from those steps. Build a summarization step every few cycles that compresses old context before it becomes a large fraction of your input cost.
Prompt caching for the system prompt deserves explicit mention in the agent context. The system prompt is the same on every step of every run. If it is 2,000 tokens, caching it means steps 2 through 20 pay 0.10x the base rate for those tokens instead of 1.00x. On Sonnet 4.6, uncached system prompt cost across 20 steps is 20 times 2,000 divided by 1,000,000 times $3.00 equals $0.12 per run. Cached, it is $0.012 per run. Over 1,000 runs per day, that is $108 per day in savings from one structural change.
Setting explicit cost budgets per agent run is an operational discipline that most teams defer until they have been surprised by a bill. If you cap each run at $0.50 maximum, an agent that would have escalated to 40 steps terminates at 20 and returns a partial result. Partial results with low confidence are often preferable to silent cost overruns, and they prompt the right conversation about task design.
The fifth pattern is decomposing complex tasks into independent sub-agents that do not share context. A monolithic 20-step agent that keeps all context in one conversation is the most expensive architecture. If the task has logical segments that can be computed independently and then joined, running them as separate agents, each with its own shorter context, is cheaper than one long chain. The inputs to each sub-agent are smaller, the per-step context is bounded, and you can cache shared system context across agents more effectively.
Connecting This to the Broader Cost Picture
Multi-agent cost management is ultimately an extension of the same principles that govern single-call cost optimization. Fewer tokens per call, cheaper models for simpler steps, and caching for repeated content. The difference is that in agent systems, these principles need to be applied at the architectural level, not just the prompt level.
For teams building production agent systems, PromptUnit provides token-level visibility into each agent step, making it possible to identify which steps are the highest cost and whether they correlate with quality outcomes.
For further reading on the cost structure of tool use and function calling in agentic contexts, see the hidden cost of tool use and function calling. For a framework to decide when routing to a cheaper model within an agent step is safe, see the LLM model routing guide.
If your agent infrastructure is costing more than you planned, PromptUnit gives you per-step cost attribution to find out exactly where the tokens are going.