Documentation
Everything you need to integrate PromptUnit into your stack. Two-minute setup, zero code changes beyond a single base URL swap.
Getting Started
Quick Start
Integrate PromptUnit in two steps. No new SDKs, no refactoring — just change your base URL and add two headers.
from openai import OpenAI
client = OpenAI(
api_key="your-openai-key",
base_url="https://api.promptunit.ai/proxy/openai",
default_headers={
"x-promptunit-key": "pu_live_xxxxxxxxxxxx",
"x-promptunit-feature": "customer-support",
}
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)import OpenAI from "openai"
const openai = new OpenAI({
apiKey: "your-openai-key",
baseURL: "https://api.promptunit.ai/proxy/openai",
defaultHeaders: {
"x-promptunit-key": "pu_live_xxxxxxxxxxxx",
"x-promptunit-feature": "customer-support",
},
})
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
})Authentication
Two headers authenticate your requests. The PromptUnit key identifies your account; the provider key is forwarded to the upstream model provider.
| Header | Required | Description |
|---|---|---|
| x-promptunit-key | Required | Your PromptUnit API key from the dashboard. Format: pu_live_... |
| x-api-key | Optional* | Your Anthropic API key (sk-ant-...). Omit if stored in the dashboard. |
| Authorization | Optional* | Your OpenAI API key as Bearer sk-.... Omit if stored in the dashboard. |
* Provider keys can be configured once in the PromptUnit dashboard instead of passing them per-request. Per-request keys always take precedence.
Feature Tagging
The x-promptunit-feature header tags each request with the product feature that made the call. This powers per-feature cost breakdowns in your dashboard — the most valuable insight for understanding where your AI spend actually goes.
Pick descriptive, consistent kebab-case names:
You can pass any string up to 64 characters. Tags appear immediately in the dashboard after the first tagged request is received.
OpenAI Integration
Base URL
Replace the OpenAI base URL with the PromptUnit proxy endpoint. Every other part of your SDK usage stays identical.
| URL | |
|---|---|
| Direct (before) | https://api.openai.com/v1 |
| Via PromptUnit | https://api.promptunit.ai/proxy/openai |
Supported endpoints
SDK Wrapper
The @promptunit/sdk package provides a typed wrapper that handles headers automatically. The API surface is identical to the official OpenAI SDK.
import { PromptUnitOpenAI } from "@promptunit/sdk"
const client = new PromptUnitOpenAI({
promptUnitKey: "pu_live_xxxxxxxxxxxx",
openAIKey: "sk-...",
feature: "customer-support",
})
// Exactly like the OpenAI SDK — no other changes needed
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
})Anthropic Integration
Base URL
Point your Anthropic SDK at the PromptUnit proxy endpoint to get routing, compression, prompt caching optimization, and cost tracking.
| URL | |
|---|---|
| Direct (before) | https://api.anthropic.com/v1 |
| Via PromptUnit | https://api.promptunit.ai/proxy/anthropic |
Supported endpoints
SDK Wrapper
PromptUnitAnthropic wraps the official Anthropic SDK. Prompt cache optimization (Layer 13) is applied automatically when your system prompt is eligible.
import { PromptUnitAnthropic } from "@promptunit/sdk"
const client = new PromptUnitAnthropic({
promptUnitKey: "pu_live_xxxxxxxxxxxx",
anthropicKey: "sk-ant-...",
feature: "summarization",
})
const response = await client.messages.create({
model: "claude-opus-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: "Summarize this document." }],
})Response Headers
Every proxied response is augmented with x-promptunit-* headers that expose routing decisions, costs, quality scores, and optimization results. Your application can read these headers directly, or they are automatically ingested into your dashboard.
| Header | Example | Description |
|---|---|---|
| x-promptunit-cost | 0.000420 | Actual cost in USD for this call |
| x-promptunit-latency | 340ms | End-to-end latency |
| x-promptunit-model-used | gpt-4o-mini | Model that actually ran |
| x-promptunit-model-requested | gpt-4o | Model the client asked for |
| x-promptunit-task-type | summarization | Detected task classification |
| x-promptunit-saving | 0.003200 | USD saved vs requested model |
| x-promptunit-routing | routed / passthrough / observation | Routing decision |
| x-promptunit-spam-score | 0.12 | 0 = clean, 1 = definite spam |
| x-promptunit-circuit-breaker | ok / anomaly / open | Budget guard status |
| x-promptunit-prompt-cache | hit:3 / miss | Anthropic prompt cache status |
| x-promptunit-compression | 1240 | Tokens saved by compression |
| x-promptunit-dialect | none / oai_to_claude | Dialect rules applied |
| x-promptunit-inflation | clean / token_stuffing | Inflation detection result |
| x-promptunit-output-verified | ok / truncated / refusal | Output quality check |
| x-promptunit-efficiency | 87 | Prompt efficiency score 0–100 |
| x-promptunit-efficiency-issues | verbose_preamble,filler_phrases | Detected prompt issues |
| x-promptunit-context-injected | date,recency | Auto-injected context fields |
Inferio™ Engine
Inferio™ is the inference optimization engine that processes every request before it reaches the upstream model provider. It operates as a transparent pipeline: your request goes in, an optimized request goes out to the best-fit model, and the original response format is returned to your application unchanged.
How Routing Works
For each request, Inferio™ computes a score across 10 dimensions:
The score vector is matched against a continuously-updated capability matrix of available models. The cheapest model whose capability profile meets your configured quality threshold receives the request. If no cheaper model qualifies, the originally requested model is used as a passthrough.
The 22 Layers
Inferio™ runs each request through up to 22 processing layers. Not all layers fire on every request — the pipeline is conditional based on request characteristics and your configuration.
Multi-dimensional model selection across 10 scoring axes
Shannon entropy + injection pattern detection
Anthropic prompt caching — 10x cheaper reads for repeated system prompts
TF-IDF compression for long conversations to reduce token overhead
Detects wasteful prompt patterns, scores prompts 0–100
Checks for refusals, truncation, and format mismatch
Aggregate learnings across anonymized traffic (coming soon)
Strips zero-width characters, detects repetition-based token attacks
Auto-translates between OpenAI and Anthropic prompt formats
Injects date, locale, and recency hints for better output accuracy
Rolling spend windows, auto-downgrade, anomaly detection
Layer numbers reflect internal pipeline position. Non-sequential numbers indicate reserved slots for layers in private beta or upcoming release.
Limits & Billing
PromptUnit charges only on verified savings — the measurable difference between what you would have paid and what you actually paid through optimized routing.
Billing model
20% of verified savings
Billed on the 1st of each month
Zero savings
$0 bill
If Inferio™ saves you nothing, you pay nothing
Default spend limit
$100 / hr · $500 / day
Configurable per project in the dashboard
Rate limits
Provider-parity
Same as your upstream provider's limits
How savings are calculated
- 01Each call records the requested model and the model that actually ran, logged in
x-promptunit-model-requestedandx-promptunit-model-used. - 02The cost delta — what you would have paid minus what you paid — is recorded in
x-promptunit-savingper call. - 03Monthly savings are summed and auditable. You can export the full call log at any time from the dashboard.
- 0420% of the verified monthly savings total is invoiced. If the total is $0, no invoice is generated.
Ready to start?
Your first optimization in 5 minutes
No credit card required. If PromptUnit doesn't reduce your AI spend, you pay nothing.