Home/Documentation

Documentation

Everything you need to integrate PromptUnit into your stack. Two-minute setup, zero code changes beyond a single base URL swap.

Getting Started

Quick Start

Integrate PromptUnit in two steps. No new SDKs, no refactoring. just change your base URL and add two headers.

1Change your base URL and add your PromptUnit headers
python
from openai import OpenAI

client = OpenAI(
    api_key="your-openai-key",
    base_url="https://api.promptunit.ai/api/proxy/openai",
    default_headers={
        "x-promptunit-key": "pu_live_xxxxxxxxxxxx",
        "x-promptunit-feature": "customer-support",
    }
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)
javascript / node.js
import OpenAI from "openai"

const openai = new OpenAI({
  apiKey: "your-openai-key",
  baseURL: "https://api.promptunit.ai/api/proxy/openai",
  defaultHeaders: {
    "x-promptunit-key": "pu_live_xxxxxxxxxxxx",
    "x-promptunit-feature": "customer-support",
  },
})

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
})
2Every call now flows through Inferio™. no other changes needed
During the first 14 days, Inferio™ runs in observation mode. it records what it would have done without changing your routing. You see the potential savings before any optimization is applied.

Authentication

Two headers authenticate your requests. The PromptUnit key identifies your account; the provider key is forwarded to the upstream model provider.

HeaderRequiredDescription
x-promptunit-keyRequiredYour PromptUnit API key from the dashboard. Format: pu_live_...
x-api-keyOptional*Your Anthropic API key (sk-ant-...). Omit if stored in the dashboard.
AuthorizationOptional*Your OpenAI API key as Bearer sk-.... Omit if stored in the dashboard.

* Provider keys can be configured once in the PromptUnit dashboard instead of passing them per-request. Per-request keys always take precedence.

Feature Tagging

The x-promptunit-feature header tags each request with the product feature that made the call. This powers per-feature cost breakdowns in your dashboard. the most valuable insight for understanding where your AI spend actually goes.

Feature tagging is optional but strongly recommended. Without it, all your traffic appears as a single unlabeled bucket in analytics.

Pick descriptive, consistent kebab-case names:

customer-support
summarization
code-review
onboarding
search
content-gen

You can pass any string up to 64 characters. Tags appear immediately in the dashboard after the first tagged request is received.

OpenAI Integration

Base URL

Replace the OpenAI base URL with the PromptUnit proxy endpoint. Every other part of your SDK usage stays identical.

URL
Direct (before)https://api.openai.com/v1
Via PromptUnithttps://api.promptunit.ai/api/proxy/openai

Supported endpoints

POST/chat/completionsfully compatible, streaming supported

SDK Wrapper

The @promptunit/sdk package provides a typed wrapper that handles headers automatically. The API surface is identical to the official OpenAI SDK.

typescript
import { PromptUnitOpenAI } from "@promptunit/sdk"

const client = new PromptUnitOpenAI({
  promptUnitKey: "pu_live_xxxxxxxxxxxx",
  openAIKey: "sk-...",
  feature: "customer-support",
})

// Exactly like the OpenAI SDK. no other changes needed
const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
})
The SDK wrapper is a thin layer over the standard OpenAI SDK. All parameters, return types, and streaming behavior are identical. only the base URL and authentication headers are managed for you.

Anthropic Integration

Base URL

Point your Anthropic SDK at the PromptUnit proxy endpoint to get routing, compression, prompt caching optimization, and cost tracking.

URL
Direct (before)https://api.anthropic.com/v1
Via PromptUnithttps://api.promptunit.ai/proxy/anthropic

Supported endpoints

POST/messagesfully compatible, streaming supported

SDK Wrapper

PromptUnitAnthropic wraps the official Anthropic SDK. Prompt cache optimization (Layer 13) is applied automatically when your system prompt is eligible.

typescript
import { PromptUnitAnthropic } from "@promptunit/sdk"

const client = new PromptUnitAnthropic({
  promptUnitKey: "pu_live_xxxxxxxxxxxx",
  anthropicKey: "sk-ant-...",
  feature: "summarization",
})

const response = await client.messages.create({
  model: "claude-opus-4-5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Summarize this document." }],
})

Response Headers

Every proxied response is augmented with x-promptunit-* headers that expose routing decisions, costs, quality scores, and optimization results. Your application can read these headers directly, or they are automatically ingested into your dashboard.

HeaderExampleDescription
x-promptunit-cost0.000420Actual cost in USD for this call
x-promptunit-latency340msEnd-to-end latency
x-promptunit-model-usedgpt-4o-miniModel that actually ran
x-promptunit-model-requestedgpt-4oModel the client asked for
x-promptunit-task-typesummarizationDetected task classification
x-promptunit-saving0.003200USD saved vs requested model
x-promptunit-routingrouted / passthrough / observationRouting decision
x-promptunit-spam-score0.120 = clean, 1 = definite spam
x-promptunit-circuit-breakerok / anomaly / openBudget guard status
x-promptunit-prompt-cachehit:3 / missAnthropic prompt cache status
x-promptunit-compression1240Tokens saved by compression
x-promptunit-dialectnone / oai_to_claudeDialect rules applied
x-promptunit-inflationclean / token_stuffingInflation detection result
x-promptunit-output-verifiedok / truncated / refusalOutput quality check
x-promptunit-efficiency87Prompt efficiency score 0–100
x-promptunit-efficiency-issuesverbose_preamble,filler_phrasesDetected prompt issues
x-promptunit-context-injecteddate,recencyAuto-injected context fields
x-promptunit-cachesemantic-hit / missSemantic cache result (Layer 24)
x-promptunit-complexitysimple / medium / complex / reasoning-heavyDetected prompt complexity (Layer 23)
x-promptunit-complexity-score45Raw complexity score 0–100
x-promptunit-high-stakesfalse / medical / legal / financialHigh-stakes detection result (Layer 25)

Inferio™ Engine

Inferio™ is the inference optimization engine that processes every request before it reaches the upstream model provider. It operates as a transparent pipeline: your request goes in, an optimized request goes out to the best-fit model, and the original response format is returned to your application unchanged.

How Routing Works

For each request, Inferio™ computes a score across 10 dimensions:

01Task type
02Complexity
03Token count
04Conversation depth
05Domain
06Output format
07Stakes level
08Language
09Context length
10Prior performance

The score vector is matched against a continuously-updated capability matrix of available models. The cheapest model whose capability profile meets your configured quality threshold receives the request. If no cheaper model qualifies, the originally requested model is used as a passthrough.

During the 14-day observation period, routing runs in shadow mode. we record what we would have routed to without actually changing anything. You see projected savings in your dashboard before any live routing is enabled.

The 27 Layers

Inferio™ runs each request through up to 27 processing layers. Not all layers fire on every request. The pipeline is conditional based on request characteristics and your configuration. Layers 23-27 are the advanced intelligence algorithms that compound in value as traffic grows.

1–10
Smart RoutingRouting

Multi-dimensional model selection across 10 scoring axes

11
Spam FilterSecurity

Shannon entropy + injection pattern detection

12
Request DeduplicationCost

Detects identical requests within a rolling window and returns cached responses, preventing duplicate spend on retried or parallel calls.

13
System Prompt CacheCost

Anthropic prompt caching. 10x cheaper reads for repeated system prompts

14
Conversation CompressionCost

TF-IDF compression for long conversations to reduce token overhead

15
Prompt Efficiency AdvisorQuality

Detects wasteful prompt patterns, scores prompts 0–100

16
Output VerificationQuality

Checks for refusals, truncation, and format mismatch

17
Latency ProfilerIntelligence

Records per-provider, per-model latency on every call. Feeds routing decisions when two models score equally on quality.

18
Cross-Customer Pattern MiningIntelligence

Aggregate learnings across anonymized traffic (coming soon)

19
Token Inflation DefenseSecurity

Strips zero-width characters, detects repetition-based token attacks

20
Dialect TranslationCompatibility

Auto-translates between OpenAI and Anthropic prompt formats

21
Context GroundingQuality

Injects date, locale, and recency hints for better output accuracy

22
Circuit BreakerSafety

Rolling spend windows, auto-downgrade, anomaly detection

23
Prompt Complexity ClassifierIntelligence

Scores every prompt across 8 complexity signals (reasoning depth, constraint count, word count, code indicators) and recommends the minimum viable model. Routes simple requests to cheap models without burning tokens on the routing decision itself.

24
Semantic Request CacheCost

Fingerprints incoming requests using normalized content hashing. Returns a cached response when an equivalent request was seen recently, avoiding redundant API calls entirely. Hit rate grows with traffic volume.

25
Multi-Model ConsensusQuality

Detects high-stakes requests (medical, legal, financial, infrastructure) and runs dual cheap-model verification before responding. If both models agree, returns consensus. If they diverge, escalates to a flagship model. Flagship quality at cheap-model price on the majority of high-stakes requests.

26
Cross-Customer Quality OracleIntelligence

Aggregates anonymized quality signals across all traffic to build a per-model, per-task-type performance index. Surfaces real-world quality benchmarks trained on millions of requests. Gets more accurate with every call across the platform.

27
Adaptive Threshold LearningIntelligence

Watches implicit feedback signals (retry patterns, follow-up corrections, session drop-off) to automatically adjust each organization's quality threshold over time. The system learns your team's real quality bar without any manual configuration.

Layer numbers reflect internal pipeline position. Non-sequential numbers indicate reserved slots for layers in private beta or upcoming release.

Logging Modes

PromptUnit stores per-request metadata to power your dashboard analytics. You control how much metadata is retained. Switch modes any time from your dashboard settings. changes take effect on the next request.

Standard (default)

  • Token counts (input + output)
  • Cost (actual + would-have-been)
  • Model used + model requested
  • Task type classification
  • Feature tag (x-promptunit-feature)
  • Efficiency score
  • Latency

Privacy

  • Token counts (input + output)
  • Cost (actual + would-have-been)
  • Model used + model requested
  • Task type classification, not stored
  • Feature tag, not stored
  • Efficiency score, not stored

Routing is unaffected by logging mode. Classification and model selection happen in memory during the request. Logging mode only controls what gets written to the database after the response is returned.

In Privacy mode, the Spend by Feature breakdown in your dashboard will show Unknown for feature names and the task type column in your logs will be empty. Cost and savings data remain fully accurate.

GitHub Action

The PromptUnit AI Cost Analyzer is a GitHub Action that scans every pull request for expensive AI model usage and posts a comment with routing savings estimates. It runs automatically on PR open and sync with no build step required.

What it does

  • Scans added lines in the PR diff for GPT-4o, Claude Opus, Gemini 2.5 Pro and other expensive models
  • Skips PRs that already use PromptUnit (no duplicate comments)
  • Posts a cost comparison table showing what routing would save
  • Updates the existing comment on re-push rather than duplicating it

Add to your repo

Create .github/workflows/ai-cost-analyzer.yml in your repository:

yaml
name: AI Cost Analyzer

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  analyze:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: promptunit/sdk@main
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

That's the entire setup. The action uses only the standard GITHUB_TOKEN, no extra secrets required.

Setup time

2 minutes

Paste the YAML, done

Dependencies

Zero

No npm install, no build step

Permissions needed

pull-requests: write

Read PR diff, post comment

The action is self-contained and uses only Node.js built-in modules. It detects usage of GPT-4o, GPT-4 Turbo, o1, o3, Claude Opus, and Gemini 2.5 Pro in PR diffs. If any expensive model is found and PromptUnit is not already integrated, it posts a comment showing the one-line SDK change and a savings projection.

Source and full documentation are available at github.com/promptunit/sdk.

Limits & Billing

PromptUnit charges only on verified savings. the measurable difference between what you would have paid and what you actually paid through optimized routing.

Billing model

20% of verified savings

Billed monthly. $1 setup fee credited to first invoice.

Zero savings

$0 bill

If Inferio™ saves you nothing, you pay nothing

Default spend limit

$100 / hr · $500 / day

Configurable per project in the dashboard

Rate limits

Provider-parity

Same as your upstream provider's limits

How savings are calculated

  1. 01Each call records the requested model and the model that actually ran, logged inx-promptunit-model-requested and x-promptunit-model-used.
  2. 02The cost delta. what you would have paid minus what you paid. is recorded inx-promptunit-saving per call.
  3. 03Monthly savings are summed and auditable. You can export the full call log at any time from the dashboard.
  4. 0420% of the verified monthly savings total is invoiced. If the total is $0, no invoice is generated.
Spend limits are enforced at the proxy layer. When a rolling window limit is breached, the circuit breaker (Layer 22) activates and may auto-downgrade requests to a cheaper model rather than rejecting them, depending on your configuration.

Ready to start?

Your first optimization in 5 minutes

14-day observation period. If PromptUnit doesn't reduce your AI spend, you pay nothing.