How to Build LLM Provider Failover That Actually Works

OpenAI had a major outage in April 2026 that took down API access for several hours. Teams with automatic failover were unaffected. Teams without it lost revenue, had degraded products during peak hours, and in several cases burned engineering time on manual interventions that should never have been necessary. The gap between those two outcomes was not architectural sophistication. It was a handful of decisions about how to handle provider failures, made in advance.

This is a guide to making those decisions correctly.

What Actually Breaks During a Provider Outage

Understanding what fails helps you prioritize what to protect. Any product feature that makes a synchronous LLM call, where a user action triggers the call and the user waits for the response, fails immediately and visibly when the provider is unreachable. Chat interfaces, code completions, document analysis, any user-facing feature in the request-response loop goes down with the provider.

Asynchronous and batch pipelines are more resilient by design. If your document processing queue writes jobs to a queue and a worker picks them up asynchronously, a provider outage means jobs sit in queue rather than failing. Users see delays, not errors. This is meaningfully different from a user asking a question and getting a 500 error. If you have latency-tolerant workloads, batch architecture using the batch API gives you natural resilience at no extra engineering cost.

The real exposure for most teams is the synchronous path. That is where failover matters most.

Three Failover Options, Ranked by How Well They Actually Work

Manual failover is the approach most teams have by default. When the provider is down, an engineer notices alerts, pulls up the configuration, changes the API key or base URL to a secondary provider, and deploys or restarts the service. In theory this works. In practice, it takes 10 to 60 minutes from detection to resolution. By then, the outage is over or users have already churned to your competitor. Manual failover is better than nothing, but it is not a reliability strategy.

A circuit breaker pattern is a significant improvement. The core idea: track your recent call outcomes. When the error rate or timeout rate exceeds a threshold, for example more than 5 errors in the last 10 calls, open the circuit. In the open state, route requests to a fallback provider instead of attempting the failing primary. After a configured interval, send a test request to the primary. If it succeeds, close the circuit and resume normal routing. Libraries like resilience4j (Java), polly (.NET), and tenacity (Python) implement this pattern. You can also implement it in a proxy layer, which is where it tends to work best in multi-service architectures.

Active-active routing is the most robust option and the architecture that made teams genuinely immune to the April 2026 outage. In this model, all requests flow through a routing proxy that can send any request to any configured provider. The primary provider receives requests by default. When the proxy detects provider failures, it switches automatically with no deployment required and no human in the loop. Failover time is measured in milliseconds, not minutes.

The Real Problem: API Format Differences

Teams that have implemented failover often discover a problem they underestimated: different providers have different API formats, and switching providers is not just a matter of changing a URL.

The OpenAI chat completions format uses a messages array with role and content fields, a model parameter, and returns choices with message objects. Anthropic's API uses a similar messages structure but handles system prompts differently, requires a max_tokens parameter, and returns content as an array of blocks rather than a single string. The response schemas are meaningfully different. A naive failover that points your OpenAI client at Anthropic will fail immediately.

Handling this properly requires dialect translation: a layer that accepts requests in one format, translates them to the target provider's format, makes the call, and translates the response back. This is solvable engineering, but it is not a 30-minute job. You need to handle system prompt format differences, map model names across providers, normalize stop sequences, and deal with provider-specific parameters that have no equivalent on the other side.

This is why failover implementations that live in the application layer, where you are writing OpenAI SDK calls directly, are hard to maintain across multiple providers. Dialect translation belongs in a proxy or middleware layer that your application code doesn't need to think about. Once you have an LLM inference proxy handling translation, adding a new failover target is a configuration change rather than a code change.

Testing Your Failover Before You Need It

A failover you have never tested is a failover you do not have. You have a failover hypothesis.

The standard approach is chaos engineering: deliberately inject failures into your system to verify that the failover path activates correctly. For teams running on Kubernetes, chaos-mesh can inject network partitions and pod failures. For application-layer testing, a custom middleware that randomly returns 503 responses to a configurable percentage of calls is sufficient. Run this during a load test and verify that your fallback provider receives traffic and that response quality is acceptable.

The specific things to verify: that the circuit breaker actually opens when it should, that dialect translation produces valid requests on the fallback provider, that response format normalization works correctly so your application code doesn't break, and that the circuit closes again when the primary provider recovers.

A less disruptive option is shadow testing: route a small percentage of production traffic to your fallback provider in parallel with the primary, compare responses, and verify that the translation layer is working correctly. This gives you confidence in the failover path without requiring a synthetic failure event.

Health Checks and Alerting

Don't rely on production traffic to detect provider failures. By the time your error rate climbs, users have already seen errors. A dedicated health check service that pings provider health endpoints every 30 seconds gives you early warning. OpenAI provides a status API. Anthropic maintains a public status page. Both can be polled programmatically. When the health check detects degradation, you want to know before your production error rate moves.

The alert hierarchy should be layered. First alert: provider health check degraded, no production impact yet. Second alert: error rate on primary provider above threshold, circuit breaker has opened, failover active. Third alert: failover provider also degraded, now you have a real problem that requires a human.

Most teams only have the second alert, which means they learn about problems from users. Adding the first level costs almost nothing and gives you a meaningful head start.

Combining Failover with Cost Optimization

Provider failover and cost routing are not separate problems. The same proxy layer that handles automatic failover can also route requests across providers based on cost, sending simple classification tasks to cheaper models while reserving expensive models for tasks that need them. The difference between a failover configuration and a routing configuration is the trigger: failover routes on provider health, cost routing routes on task characteristics.

Teams that build the infrastructure for one tend to get the other relatively cheaply, because the hard parts, dialect translation, response normalization, and provider configuration management, are shared.

PromptUnit's circuit breaker, part of the Inferio engine's 27-layer request processing pipeline, detects provider failures and automatically routes to the next available provider while handling API format translation between OpenAI and Anthropic formats transparently. Your application code doesn't change when the failover activates.

The April 2026 outage was not the last provider outage. The teams that built failover after it will be better prepared for the next one than the teams that are reading about this for the first time right now.

See our Multi-Provider Failover solution for how the circuit breaker, dialect translation, and SDK-level fallback described above work as a packaged integration.

Set up automatic provider failover with PromptUnit at promptunit.ai.