Home/Documentation

Documentation

Everything you need to integrate PromptUnit into your stack. Two-minute setup, zero code changes beyond a single base URL swap.

Getting Started

Quick Start

Integrate PromptUnit in two steps. No new SDKs, no refactoring. just change your base URL and add two headers.

1Change your base URL and add your PromptUnit headers

python

from openai import OpenAI

client = OpenAI(
    api_key="your-openai-key",
    base_url="https://api.promptunit.ai/api/proxy/openai",
    default_headers={
        "x-promptunit-key": "pu_live_xxxxxxxxxxxx",
        "x-promptunit-feature": "customer-support",
    }
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

javascript / node.js

import OpenAI from "openai"

const openai = new OpenAI({
  apiKey: "your-openai-key",
  baseURL: "https://api.promptunit.ai/api/proxy/openai",
  defaultHeaders: {
    "x-promptunit-key": "pu_live_xxxxxxxxxxxx",
    "x-promptunit-feature": "customer-support",
  },
})

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
})

2Every call now flows through Inferio™. no other changes needed

During the first 14 days, Inferio™ runs in observation mode. it records what it would have done without changing your routing. You see the potential savings before any optimization is applied.

Authentication

Two headers authenticate your requests. The PromptUnit key identifies your account; the provider key is forwarded to the upstream model provider.

Header	Required	Description
x-promptunit-key	Required	Your PromptUnit API key from the dashboard. Format: `pu_live_...`
x-api-key	Optional*	Your Anthropic API key (`sk-ant-...`). Omit if stored in the dashboard.
Authorization	Optional*	Your OpenAI API key as `Bearer sk-...`. Omit if stored in the dashboard.

* Provider keys can be configured once in the PromptUnit dashboard instead of passing them per-request. Per-request keys always take precedence.

Feature Tagging

The x-promptunit-feature header tags each request with the product feature that made the call. This powers per-feature cost breakdowns in your dashboard. the most valuable insight for understanding where your AI spend actually goes.

Feature tagging is optional but strongly recommended. Without it, all your traffic appears as a single unlabeled bucket in analytics.

Pick descriptive, consistent kebab-case names:

customer-support

summarization

code-review

onboarding

content-gen

You can pass any string up to 64 characters. Tags appear immediately in the dashboard after the first tagged request is received.

Supported Providers

PromptUnit routes across 10 providers. All use the OpenAI-compatible wire format, so you connect once and the system routes automatically. Add provider keys in the dashboard under API Keys.

Provider	Models	Price range	Key format
OpenAI	gpt-4o, gpt-4o-mini, gpt-5.4 series	$0.15–$180/MTok	sk-proj-...
Anthropic	claude-opus-4, claude-sonnet-4, claude-haiku-4	$1–$75/MTok	sk-ant-...
Google	gemini-2.5-pro/flash, gemini-2.0-flash	$0.075–$10/MTok	AIza...
Groq	llama-3.3-70b, qwen3-32b, llama-3.1-8b	$0.05–$0.79/MTok	gsk_...
DeepSeek	deepseek-v4-pro, deepseek-v4-flash	$0.35–$3.48/MTok	sk-...
Mistral	mistral-large-latest, mistral-small-latest, codestral	$0.10–$6/MTok	any string
Together AI	Llama-3.3-70B-Instruct-Turbo, Llama-3.1-8B-Instruct-Turbo	$0.18–$0.88/MTok	any string
Perplexity	sonar-pro, sonar (web-augmented)	$1–$15/MTok	pplx-...
xAI	grok-3, grok-3-mini	$0.50–$15/MTok	xai-...
Cohere	command-r-plus, command-r, command-r7b	$0.15–$10/MTok	any string

CPR (Cross-Provider Routing) evaluates all connected providers on every call and routes to the globally cheapest model that clears your quality threshold. You get routing across your entire provider stack with zero extra code.

OpenAI Integration

Base URL

Replace the OpenAI base URL with the PromptUnit proxy endpoint. Every other part of your SDK usage stays identical.

	URL
Direct (before)	https://api.openai.com/v1
Via PromptUnit	https://api.promptunit.ai/api/proxy/openai

Supported endpoints

POST/chat/completionsfully compatible, streaming supported

SDK Wrapper

The @promptunit/sdk package provides a typed wrapper that handles headers automatically. The API surface is identical to the official OpenAI SDK.

typescript

import { PromptUnitOpenAI } from "@promptunit/sdk"

const client = new PromptUnitOpenAI({
  promptUnitKey: "pu_live_xxxxxxxxxxxx",
  openAIKey: "sk-...",
  feature: "customer-support",
})

// Exactly like the OpenAI SDK. no other changes needed
const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
})

The SDK wrapper is a thin layer over the standard OpenAI SDK. All parameters, return types, and streaming behavior are identical. only the base URL and authentication headers are managed for you.

Anthropic Integration

Base URL

Point your Anthropic SDK at the PromptUnit proxy endpoint to get routing, compression, prompt caching optimization, and cost tracking.

	URL
Direct (before)	https://api.anthropic.com/v1
Via PromptUnit	https://api.promptunit.ai/proxy/anthropic

Supported endpoints

POST/messagesfully compatible, streaming supported

SDK Wrapper

PromptUnitAnthropic wraps the official Anthropic SDK. Prompt cache optimization (Layer 13) is applied automatically when your system prompt is eligible.

typescript

import { PromptUnitAnthropic } from "@promptunit/sdk"

const client = new PromptUnitAnthropic({
  promptUnitKey: "pu_live_xxxxxxxxxxxx",
  anthropicKey: "sk-ant-...",
  feature: "summarization",
})

const response = await client.messages.create({
  model: "claude-opus-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Summarize this document." }],
})

Response Headers

Every proxied response is augmented with x-promptunit-* headers that expose routing decisions, costs, quality scores, and optimization results. Your application can read these headers directly, or they are automatically ingested into your dashboard.

Header	Example	Description
x-promptunit-cost	0.000420	Actual cost in USD for this call
x-promptunit-latency	340ms	End-to-end latency
x-promptunit-model-used	gpt-4o-mini	Model that actually ran
x-promptunit-model-requested	gpt-4o	Model the client asked for
x-promptunit-task-type	summarization	Detected task classification
x-promptunit-saving	0.003200	USD saved vs requested model
x-promptunit-routing	routed / passthrough / observation	Routing decision
x-promptunit-spam-score	0.12	0 = clean, 1 = definite spam
x-promptunit-circuit-breaker	ok / anomaly / open	Budget guard status
x-promptunit-prompt-cache	hit:3 / miss	Anthropic prompt cache status
x-promptunit-compression	1240	Tokens saved by compression
x-promptunit-dialect	none / oai_to_claude	Dialect rules applied
x-promptunit-inflation	clean / token_stuffing	Inflation detection result
x-promptunit-output-verified	ok / truncated / refusal	Output quality check
x-promptunit-efficiency	87	Prompt efficiency score 0–100
x-promptunit-efficiency-issues	verbose_preamble,filler_phrases	Detected prompt issues
x-promptunit-context-injected	date,recency	Auto-injected context fields
x-promptunit-cache	semantic-hit / miss	Semantic cache result (Layer 24)
x-promptunit-complexity	simple / medium / complex / reasoning-heavy	Detected prompt complexity (Layer 23)
x-promptunit-complexity-score	45	Raw complexity score 0–100
x-promptunit-high-stakes	false / medical / legal / financial	High-stakes detection result (Layer 25)

Inferio™ Engine

Inferio™ is the inference optimization engine that processes every request before it reaches the upstream model provider. It operates as a transparent pipeline: your request goes in, an optimized request goes out to the best-fit model, and the original response format is returned to your application unchanged.

How Routing Works

For each request, Inferio™ computes a score across 10 dimensions:

01Task type

02Complexity

03Token count

04Conversation depth

05Domain

06Output format

07Stakes level

08Language

09Context length

10Prior performance

The score vector is matched against a continuously-updated capability matrix of available models. The cheapest model whose capability profile meets your configured quality threshold receives the request. If no cheaper model qualifies, the originally requested model is used as a passthrough.

During the 14-day observation period, routing runs in shadow mode. we record what we would have routed to without actually changing anything. You see projected savings in your dashboard before any live routing is enabled.

The 27 Layers

Inferio™ runs each request through up to 27 processing layers. Not all layers fire on every request. The pipeline is conditional based on request characteristics and your configuration. Layers 23-27 are the advanced intelligence algorithms that compound in value as traffic grows.

1–10

Smart RoutingRouting

Multi-dimensional model selection across 10 scoring axes

Spam FilterSecurity

Shannon entropy + injection pattern detection

Request DeduplicationCost

Detects identical requests within a rolling window and returns cached responses, preventing duplicate spend on retried or parallel calls.

System Prompt CacheCost

Anthropic prompt caching. 10x cheaper reads for repeated system prompts

Conversation CompressionCost

TF-IDF compression for long conversations to reduce token overhead

Prompt Efficiency AdvisorQuality

Detects wasteful prompt patterns, scores prompts 0–100

Output VerificationQuality

Checks for refusals, truncation, and format mismatch

Latency ProfilerIntelligence

Records per-provider, per-model latency on every call. Feeds routing decisions when two models score equally on quality.

Cross-Customer Pattern MiningIntelligence

Aggregate learnings across anonymized traffic (coming soon)

Token Inflation DefenseSecurity

Strips zero-width characters, detects repetition-based token attacks

Dialect TranslationCompatibility

Auto-translates between OpenAI and Anthropic prompt formats

Context GroundingQuality

Injects date, locale, and recency hints for better output accuracy

Circuit BreakerSafety

Rolling spend windows, auto-downgrade, anomaly detection

Prompt Complexity ClassifierIntelligence

Scores every prompt across 8 complexity signals (reasoning depth, constraint count, word count, code indicators) and recommends the minimum viable model. Routes simple requests to cheap models without burning tokens on the routing decision itself.

Semantic Request CacheCost

Fingerprints incoming requests using normalized content hashing. Returns a cached response when an equivalent request was seen recently, avoiding redundant API calls entirely. Hit rate grows with traffic volume.

Multi-Model ConsensusQuality

Detects high-stakes requests (medical, legal, financial, infrastructure) and runs dual cheap-model verification before responding. If both models agree, returns consensus. If they diverge, escalates to a flagship model. Flagship quality at cheap-model price on the majority of high-stakes requests.

Cross-Customer Quality OracleIntelligence

Aggregates anonymized quality signals across all traffic to build a per-model, per-task-type performance index. Surfaces real-world quality benchmarks trained on millions of requests. Gets more accurate with every call across the platform.

Adaptive Threshold LearningIntelligence

Watches implicit feedback signals (retry patterns, follow-up corrections, session drop-off) to automatically adjust each organization's quality threshold over time. The system learns your team's real quality bar without any manual configuration.

Layer numbers reflect internal pipeline position. Non-sequential numbers indicate reserved slots for layers in private beta or upcoming release.

Logging Modes

PromptUnit stores per-request metadata to power your dashboard analytics. You control how much metadata is retained. Switch modes any time from your dashboard settings. changes take effect on the next request.

Standard (default)

Token counts (input + output)
Cost (actual + would-have-been)
Model used + model requested
Task type classification
Feature tag (x-promptunit-feature)
Efficiency score
Latency

Privacy

Token counts (input + output)
Cost (actual + would-have-been)
Model used + model requested

Task type classification, not stored
Feature tag, not stored
Efficiency score, not stored

Routing is unaffected by logging mode. Classification and model selection happen in memory during the request. Logging mode only controls what gets written to the database after the response is returned.

In Privacy mode, the Spend by Feature breakdown in your dashboard will show Unknown for feature names and the task type column in your logs will be empty. Cost and savings data remain fully accurate.

GitHub Action

The PromptUnit AI Cost Analyzer is a GitHub Action that scans every pull request for expensive AI model usage and posts a comment with routing savings estimates. It runs automatically on PR open and sync with no build step required.

What it does

Scans added lines in the PR diff for GPT-4o, Claude Opus, Gemini 2.5 Pro and other expensive models
Skips PRs that already use PromptUnit (no duplicate comments)
Posts a cost comparison table showing what routing would save
Updates the existing comment on re-push rather than duplicating it

Add to your repo

Create .github/workflows/ai-cost-analyzer.yml in your repository:

yaml

name: AI Cost Analyzer

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  analyze:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: promptunit/sdk@main
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

That's the entire setup. The action uses only the standard GITHUB_TOKEN, no extra secrets required.

Setup time

2 minutes

Paste the YAML, done

Dependencies

Zero

No npm install, no build step

Permissions needed

pull-requests: write

Read PR diff, post comment

The action is self-contained and uses only Node.js built-in modules. It detects usage of GPT-4o, GPT-4 Turbo, o1, o3, Claude Opus, and Gemini 2.5 Pro in PR diffs. If any expensive model is found and PromptUnit is not already integrated, it posts a comment showing the one-line SDK change and a savings projection.

Source and full documentation are available at github.com/promptunit/sdk.

Limits & Billing

PromptUnit charges only on verified savings. the measurable difference between what you would have paid and what you actually paid through optimized routing.

Billing model

20% of verified savings

Billed monthly. $1 setup fee credited to first invoice.

Zero savings

$0 bill

If Inferio™ saves you nothing, you pay nothing

Default spend limit

$100 / hr · $500 / day

Configurable per project in the dashboard

Rate limits

Provider-parity

Same as your upstream provider's limits

How savings are calculated

01Each call records the requested model and the model that actually ran, logged inx-promptunit-model-requested and x-promptunit-model-used.
02The cost delta. what you would have paid minus what you paid. is recorded inx-promptunit-saving per call.
03Monthly savings are summed and auditable. You can export the full call log at any time from the dashboard.
0420% of the verified monthly savings total is invoiced. If the total is $0, no invoice is generated.

Spend limits are enforced at the proxy layer. When a rolling window limit is breached, the circuit breaker (Layer 22) activates and may auto-downgrade requests to a cheaper model rather than rejecting them, depending on your configuration.

Ready to start?

Your first optimization in 5 minutes

14-day observation period. If PromptUnit doesn't reduce your AI spend, you pay nothing.

Get Started Free Talk to us

Documentation

#Getting Started

#Quick Start

#Authentication

#Feature Tagging

#Supported Providers

#OpenAI Integration

#Base URL

#SDK Wrapper

#Anthropic Integration

#Base URL

#SDK Wrapper

#Response Headers

#Inferio™ Engine

#How Routing Works

#The 27 Layers

#Logging Modes

#GitHub Action

#Limits & Billing

Your first optimization in 5 minutes

Getting Started

Quick Start

Authentication

Feature Tagging

Supported Providers

OpenAI Integration

Base URL

SDK Wrapper

Anthropic Integration

Base URL

SDK Wrapper

Response Headers

Inferio™ Engine

How Routing Works

The 27 Layers

Logging Modes

GitHub Action

Limits & Billing