Home/Documentation

Documentation

Everything you need to integrate PromptUnit into your stack. Two-minute setup, zero code changes beyond a single base URL swap.

Getting Started

Quick Start

Integrate PromptUnit in two steps. No new SDKs, no refactoring — just change your base URL and add two headers.

1Change your base URL and add your PromptUnit headers
python
from openai import OpenAI

client = OpenAI(
    api_key="your-openai-key",
    base_url="https://api.promptunit.ai/proxy/openai",
    default_headers={
        "x-promptunit-key": "pu_live_xxxxxxxxxxxx",
        "x-promptunit-feature": "customer-support",
    }
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)
javascript / node.js
import OpenAI from "openai"

const openai = new OpenAI({
  apiKey: "your-openai-key",
  baseURL: "https://api.promptunit.ai/proxy/openai",
  defaultHeaders: {
    "x-promptunit-key": "pu_live_xxxxxxxxxxxx",
    "x-promptunit-feature": "customer-support",
  },
})

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
})
2Every call now flows through Inferio™ — no other changes needed
During the first 14 days, Inferio™ runs in observation mode — it records what it would have done without changing your routing. You see the potential savings before any optimization is applied.

Authentication

Two headers authenticate your requests. The PromptUnit key identifies your account; the provider key is forwarded to the upstream model provider.

HeaderRequiredDescription
x-promptunit-keyRequiredYour PromptUnit API key from the dashboard. Format: pu_live_...
x-api-keyOptional*Your Anthropic API key (sk-ant-...). Omit if stored in the dashboard.
AuthorizationOptional*Your OpenAI API key as Bearer sk-.... Omit if stored in the dashboard.

* Provider keys can be configured once in the PromptUnit dashboard instead of passing them per-request. Per-request keys always take precedence.

Feature Tagging

The x-promptunit-feature header tags each request with the product feature that made the call. This powers per-feature cost breakdowns in your dashboard — the most valuable insight for understanding where your AI spend actually goes.

Feature tagging is optional but strongly recommended. Without it, all your traffic appears as a single unlabeled bucket in analytics.

Pick descriptive, consistent kebab-case names:

customer-support
summarization
code-review
onboarding
search
content-gen

You can pass any string up to 64 characters. Tags appear immediately in the dashboard after the first tagged request is received.

OpenAI Integration

Base URL

Replace the OpenAI base URL with the PromptUnit proxy endpoint. Every other part of your SDK usage stays identical.

URL
Direct (before)https://api.openai.com/v1
Via PromptUnithttps://api.promptunit.ai/proxy/openai

Supported endpoints

POST/chat/completions— fully compatible, streaming supported

SDK Wrapper

The @promptunit/sdk package provides a typed wrapper that handles headers automatically. The API surface is identical to the official OpenAI SDK.

typescript
import { PromptUnitOpenAI } from "@promptunit/sdk"

const client = new PromptUnitOpenAI({
  promptUnitKey: "pu_live_xxxxxxxxxxxx",
  openAIKey: "sk-...",
  feature: "customer-support",
})

// Exactly like the OpenAI SDK — no other changes needed
const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
})
The SDK wrapper is a thin layer over the standard OpenAI SDK. All parameters, return types, and streaming behavior are identical — only the base URL and authentication headers are managed for you.

Anthropic Integration

Base URL

Point your Anthropic SDK at the PromptUnit proxy endpoint to get routing, compression, prompt caching optimization, and cost tracking.

URL
Direct (before)https://api.anthropic.com/v1
Via PromptUnithttps://api.promptunit.ai/proxy/anthropic

Supported endpoints

POST/messages— fully compatible, streaming supported

SDK Wrapper

PromptUnitAnthropic wraps the official Anthropic SDK. Prompt cache optimization (Layer 13) is applied automatically when your system prompt is eligible.

typescript
import { PromptUnitAnthropic } from "@promptunit/sdk"

const client = new PromptUnitAnthropic({
  promptUnitKey: "pu_live_xxxxxxxxxxxx",
  anthropicKey: "sk-ant-...",
  feature: "summarization",
})

const response = await client.messages.create({
  model: "claude-opus-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Summarize this document." }],
})

Response Headers

Every proxied response is augmented with x-promptunit-* headers that expose routing decisions, costs, quality scores, and optimization results. Your application can read these headers directly, or they are automatically ingested into your dashboard.

HeaderExampleDescription
x-promptunit-cost0.000420Actual cost in USD for this call
x-promptunit-latency340msEnd-to-end latency
x-promptunit-model-usedgpt-4o-miniModel that actually ran
x-promptunit-model-requestedgpt-4oModel the client asked for
x-promptunit-task-typesummarizationDetected task classification
x-promptunit-saving0.003200USD saved vs requested model
x-promptunit-routingrouted / passthrough / observationRouting decision
x-promptunit-spam-score0.120 = clean, 1 = definite spam
x-promptunit-circuit-breakerok / anomaly / openBudget guard status
x-promptunit-prompt-cachehit:3 / missAnthropic prompt cache status
x-promptunit-compression1240Tokens saved by compression
x-promptunit-dialectnone / oai_to_claudeDialect rules applied
x-promptunit-inflationclean / token_stuffingInflation detection result
x-promptunit-output-verifiedok / truncated / refusalOutput quality check
x-promptunit-efficiency87Prompt efficiency score 0–100
x-promptunit-efficiency-issuesverbose_preamble,filler_phrasesDetected prompt issues
x-promptunit-context-injecteddate,recencyAuto-injected context fields

Inferio™ Engine

Inferio™ is the inference optimization engine that processes every request before it reaches the upstream model provider. It operates as a transparent pipeline: your request goes in, an optimized request goes out to the best-fit model, and the original response format is returned to your application unchanged.

How Routing Works

For each request, Inferio™ computes a score across 10 dimensions:

01Task type
02Complexity
03Token count
04Conversation depth
05Domain
06Output format
07Stakes level
08Language
09Context length
10Prior performance

The score vector is matched against a continuously-updated capability matrix of available models. The cheapest model whose capability profile meets your configured quality threshold receives the request. If no cheaper model qualifies, the originally requested model is used as a passthrough.

During the 14-day observation period, routing runs in shadow mode — we record what we would have routed to without actually changing anything. You see projected savings in your dashboard before any live routing is enabled.

The 22 Layers

Inferio™ runs each request through up to 22 processing layers. Not all layers fire on every request — the pipeline is conditional based on request characteristics and your configuration.

1–10
Smart RoutingRouting

Multi-dimensional model selection across 10 scoring axes

11
Spam FilterSecurity

Shannon entropy + injection pattern detection

13
System Prompt CacheCost

Anthropic prompt caching — 10x cheaper reads for repeated system prompts

14
Conversation CompressionCost

TF-IDF compression for long conversations to reduce token overhead

15
Prompt Efficiency AdvisorQuality

Detects wasteful prompt patterns, scores prompts 0–100

16
Output VerificationQuality

Checks for refusals, truncation, and format mismatch

18
Cross-Customer Pattern MiningIntelligence

Aggregate learnings across anonymized traffic (coming soon)

19
Token Inflation DefenseSecurity

Strips zero-width characters, detects repetition-based token attacks

20
Dialect TranslationCompatibility

Auto-translates between OpenAI and Anthropic prompt formats

21
Context GroundingQuality

Injects date, locale, and recency hints for better output accuracy

22
Circuit BreakerSafety

Rolling spend windows, auto-downgrade, anomaly detection

Layer numbers reflect internal pipeline position. Non-sequential numbers indicate reserved slots for layers in private beta or upcoming release.

Limits & Billing

PromptUnit charges only on verified savings — the measurable difference between what you would have paid and what you actually paid through optimized routing.

Billing model

20% of verified savings

Billed on the 1st of each month

Zero savings

$0 bill

If Inferio™ saves you nothing, you pay nothing

Default spend limit

$100 / hr · $500 / day

Configurable per project in the dashboard

Rate limits

Provider-parity

Same as your upstream provider's limits

How savings are calculated

  1. 01Each call records the requested model and the model that actually ran, logged inx-promptunit-model-requested and x-promptunit-model-used.
  2. 02The cost delta — what you would have paid minus what you paid — is recorded inx-promptunit-saving per call.
  3. 03Monthly savings are summed and auditable. You can export the full call log at any time from the dashboard.
  4. 0420% of the verified monthly savings total is invoiced. If the total is $0, no invoice is generated.
Spend limits are enforced at the proxy layer. When a rolling window limit is breached, the circuit breaker (Layer 22) activates and may auto-downgrade requests to a cheaper model rather than rejecting them, depending on your configuration.

Ready to start?

Your first optimization in 5 minutes

No credit card required. If PromptUnit doesn't reduce your AI spend, you pay nothing.