Documentation
Everything you need to integrate PromptUnit into your stack. Two-minute setup, zero code changes beyond a single base URL swap.
Getting Started
Quick Start
Integrate PromptUnit in two steps. No new SDKs, no refactoring. just change your base URL and add two headers.
from openai import OpenAI
client = OpenAI(
api_key="your-openai-key",
base_url="https://api.promptunit.ai/api/proxy/openai",
default_headers={
"x-promptunit-key": "pu_live_xxxxxxxxxxxx",
"x-promptunit-feature": "customer-support",
}
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)import OpenAI from "openai"
const openai = new OpenAI({
apiKey: "your-openai-key",
baseURL: "https://api.promptunit.ai/api/proxy/openai",
defaultHeaders: {
"x-promptunit-key": "pu_live_xxxxxxxxxxxx",
"x-promptunit-feature": "customer-support",
},
})
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
})Authentication
Two headers authenticate your requests. The PromptUnit key identifies your account; the provider key is forwarded to the upstream model provider.
| Header | Required | Description |
|---|---|---|
| x-promptunit-key | Required | Your PromptUnit API key from the dashboard. Format: pu_live_... |
| x-api-key | Optional* | Your Anthropic API key (sk-ant-...). Omit if stored in the dashboard. |
| Authorization | Optional* | Your OpenAI API key as Bearer sk-.... Omit if stored in the dashboard. |
* Provider keys can be configured once in the PromptUnit dashboard instead of passing them per-request. Per-request keys always take precedence.
Feature Tagging
The x-promptunit-feature header tags each request with the product feature that made the call. This powers per-feature cost breakdowns in your dashboard. the most valuable insight for understanding where your AI spend actually goes.
Pick descriptive, consistent kebab-case names:
You can pass any string up to 64 characters. Tags appear immediately in the dashboard after the first tagged request is received.
OpenAI Integration
Base URL
Replace the OpenAI base URL with the PromptUnit proxy endpoint. Every other part of your SDK usage stays identical.
| URL | |
|---|---|
| Direct (before) | https://api.openai.com/v1 |
| Via PromptUnit | https://api.promptunit.ai/api/proxy/openai |
Supported endpoints
SDK Wrapper
The @promptunit/sdk package provides a typed wrapper that handles headers automatically. The API surface is identical to the official OpenAI SDK.
import { PromptUnitOpenAI } from "@promptunit/sdk"
const client = new PromptUnitOpenAI({
promptUnitKey: "pu_live_xxxxxxxxxxxx",
openAIKey: "sk-...",
feature: "customer-support",
})
// Exactly like the OpenAI SDK. no other changes needed
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
})Anthropic Integration
Base URL
Point your Anthropic SDK at the PromptUnit proxy endpoint to get routing, compression, prompt caching optimization, and cost tracking.
| URL | |
|---|---|
| Direct (before) | https://api.anthropic.com/v1 |
| Via PromptUnit | https://api.promptunit.ai/proxy/anthropic |
Supported endpoints
SDK Wrapper
PromptUnitAnthropic wraps the official Anthropic SDK. Prompt cache optimization (Layer 13) is applied automatically when your system prompt is eligible.
import { PromptUnitAnthropic } from "@promptunit/sdk"
const client = new PromptUnitAnthropic({
promptUnitKey: "pu_live_xxxxxxxxxxxx",
anthropicKey: "sk-ant-...",
feature: "summarization",
})
const response = await client.messages.create({
model: "claude-opus-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: "Summarize this document." }],
})Response Headers
Every proxied response is augmented with x-promptunit-* headers that expose routing decisions, costs, quality scores, and optimization results. Your application can read these headers directly, or they are automatically ingested into your dashboard.
| Header | Example | Description |
|---|---|---|
| x-promptunit-cost | 0.000420 | Actual cost in USD for this call |
| x-promptunit-latency | 340ms | End-to-end latency |
| x-promptunit-model-used | gpt-4o-mini | Model that actually ran |
| x-promptunit-model-requested | gpt-4o | Model the client asked for |
| x-promptunit-task-type | summarization | Detected task classification |
| x-promptunit-saving | 0.003200 | USD saved vs requested model |
| x-promptunit-routing | routed / passthrough / observation | Routing decision |
| x-promptunit-spam-score | 0.12 | 0 = clean, 1 = definite spam |
| x-promptunit-circuit-breaker | ok / anomaly / open | Budget guard status |
| x-promptunit-prompt-cache | hit:3 / miss | Anthropic prompt cache status |
| x-promptunit-compression | 1240 | Tokens saved by compression |
| x-promptunit-dialect | none / oai_to_claude | Dialect rules applied |
| x-promptunit-inflation | clean / token_stuffing | Inflation detection result |
| x-promptunit-output-verified | ok / truncated / refusal | Output quality check |
| x-promptunit-efficiency | 87 | Prompt efficiency score 0–100 |
| x-promptunit-efficiency-issues | verbose_preamble,filler_phrases | Detected prompt issues |
| x-promptunit-context-injected | date,recency | Auto-injected context fields |
| x-promptunit-cache | semantic-hit / miss | Semantic cache result (Layer 24) |
| x-promptunit-complexity | simple / medium / complex / reasoning-heavy | Detected prompt complexity (Layer 23) |
| x-promptunit-complexity-score | 45 | Raw complexity score 0–100 |
| x-promptunit-high-stakes | false / medical / legal / financial | High-stakes detection result (Layer 25) |
Inferio™ Engine
Inferio™ is the inference optimization engine that processes every request before it reaches the upstream model provider. It operates as a transparent pipeline: your request goes in, an optimized request goes out to the best-fit model, and the original response format is returned to your application unchanged.
How Routing Works
For each request, Inferio™ computes a score across 10 dimensions:
The score vector is matched against a continuously-updated capability matrix of available models. The cheapest model whose capability profile meets your configured quality threshold receives the request. If no cheaper model qualifies, the originally requested model is used as a passthrough.
The 27 Layers
Inferio™ runs each request through up to 27 processing layers. Not all layers fire on every request. The pipeline is conditional based on request characteristics and your configuration. Layers 23-27 are the advanced intelligence algorithms that compound in value as traffic grows.
Multi-dimensional model selection across 10 scoring axes
Shannon entropy + injection pattern detection
Detects identical requests within a rolling window and returns cached responses, preventing duplicate spend on retried or parallel calls.
Anthropic prompt caching. 10x cheaper reads for repeated system prompts
TF-IDF compression for long conversations to reduce token overhead
Detects wasteful prompt patterns, scores prompts 0–100
Checks for refusals, truncation, and format mismatch
Records per-provider, per-model latency on every call. Feeds routing decisions when two models score equally on quality.
Aggregate learnings across anonymized traffic (coming soon)
Strips zero-width characters, detects repetition-based token attacks
Auto-translates between OpenAI and Anthropic prompt formats
Injects date, locale, and recency hints for better output accuracy
Rolling spend windows, auto-downgrade, anomaly detection
Scores every prompt across 8 complexity signals (reasoning depth, constraint count, word count, code indicators) and recommends the minimum viable model. Routes simple requests to cheap models without burning tokens on the routing decision itself.
Fingerprints incoming requests using normalized content hashing. Returns a cached response when an equivalent request was seen recently, avoiding redundant API calls entirely. Hit rate grows with traffic volume.
Detects high-stakes requests (medical, legal, financial, infrastructure) and runs dual cheap-model verification before responding. If both models agree, returns consensus. If they diverge, escalates to a flagship model. Flagship quality at cheap-model price on the majority of high-stakes requests.
Aggregates anonymized quality signals across all traffic to build a per-model, per-task-type performance index. Surfaces real-world quality benchmarks trained on millions of requests. Gets more accurate with every call across the platform.
Watches implicit feedback signals (retry patterns, follow-up corrections, session drop-off) to automatically adjust each organization's quality threshold over time. The system learns your team's real quality bar without any manual configuration.
Layer numbers reflect internal pipeline position. Non-sequential numbers indicate reserved slots for layers in private beta or upcoming release.
Logging Modes
PromptUnit stores per-request metadata to power your dashboard analytics. You control how much metadata is retained. Switch modes any time from your dashboard settings. changes take effect on the next request.
Standard (default)
- Token counts (input + output)
- Cost (actual + would-have-been)
- Model used + model requested
- Task type classification
- Feature tag (x-promptunit-feature)
- Efficiency score
- Latency
Privacy
- Token counts (input + output)
- Cost (actual + would-have-been)
- Model used + model requested
- Task type classification, not stored
- Feature tag, not stored
- Efficiency score, not stored
Routing is unaffected by logging mode. Classification and model selection happen in memory during the request. Logging mode only controls what gets written to the database after the response is returned.
In Privacy mode, the Spend by Feature breakdown in your dashboard will show Unknown for feature names and the task type column in your logs will be empty. Cost and savings data remain fully accurate.
GitHub Action
The PromptUnit AI Cost Analyzer is a GitHub Action that scans every pull request for expensive AI model usage and posts a comment with routing savings estimates. It runs automatically on PR open and sync with no build step required.
What it does
- Scans added lines in the PR diff for GPT-4o, Claude Opus, Gemini 2.5 Pro and other expensive models
- Skips PRs that already use PromptUnit (no duplicate comments)
- Posts a cost comparison table showing what routing would save
- Updates the existing comment on re-push rather than duplicating it
Add to your repo
Create .github/workflows/ai-cost-analyzer.yml in your repository:
name: AI Cost Analyzer
on:
pull_request:
types: [opened, synchronize]
jobs:
analyze:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: promptunit/sdk@main
with:
github-token: ${{ secrets.GITHUB_TOKEN }}That's the entire setup. The action uses only the standard GITHUB_TOKEN, no extra secrets required.
Setup time
2 minutes
Paste the YAML, done
Dependencies
Zero
No npm install, no build step
Permissions needed
pull-requests: write
Read PR diff, post comment
The action is self-contained and uses only Node.js built-in modules. It detects usage of GPT-4o, GPT-4 Turbo, o1, o3, Claude Opus, and Gemini 2.5 Pro in PR diffs. If any expensive model is found and PromptUnit is not already integrated, it posts a comment showing the one-line SDK change and a savings projection.
Source and full documentation are available at github.com/promptunit/sdk.
Limits & Billing
PromptUnit charges only on verified savings. the measurable difference between what you would have paid and what you actually paid through optimized routing.
Billing model
20% of verified savings
Billed monthly. $1 setup fee credited to first invoice.
Zero savings
$0 bill
If Inferio™ saves you nothing, you pay nothing
Default spend limit
$100 / hr · $500 / day
Configurable per project in the dashboard
Rate limits
Provider-parity
Same as your upstream provider's limits
How savings are calculated
- 01Each call records the requested model and the model that actually ran, logged in
x-promptunit-model-requestedandx-promptunit-model-used. - 02The cost delta. what you would have paid minus what you paid. is recorded in
x-promptunit-savingper call. - 03Monthly savings are summed and auditable. You can export the full call log at any time from the dashboard.
- 0420% of the verified monthly savings total is invoiced. If the total is $0, no invoice is generated.
Ready to start?
Your first optimization in 5 minutes
14-day observation period. If PromptUnit doesn't reduce your AI spend, you pay nothing.