/images/blog/conflict-bg.png

Most applications that use large language models are coupled to a single provider. The application calls OpenAI’s API directly, the prompt is written for GPT-4’s specific behaviors, and the response parsing assumes GPT-4’s output format. Switching to a different provider means rewriting the integration, rewriting the prompts, and rewriting the response handling.

This is the problem CalliopeAI solves. It provides a unified orchestration layer that manages multiple model providers, handles prompt adaptation across models, runs cross-model evaluations, and gives application developers a single interface that abstracts away the complexity of the multi-model landscape.

This article is a technical deep dive into how it works. It is written for engineers who are evaluating multi-model architectures or building their own orchestration layers and want to understand the design decisions and tradeoffs involved.

Architecture Overview

CalliopeAI sits between your application and the model providers. The application sends a request to CalliopeAI describing what it needs: the task type, the input data, any constraints, and the quality requirements. CalliopeAI determines which model to use, formats the request for that model’s API, sends it, processes the response, and returns a normalized result to the application.

The system has five core components:

The Provider Registry manages connections to model providers. It handles authentication, connection pooling, rate limiting, and health checking for each provider. When a provider goes down or starts returning errors, the registry marks it as unhealthy and stops routing requests to it until it recovers.

The Router decides which model handles each request. Routing decisions are based on task type, cost constraints, latency requirements, quality requirements, and provider health. The router can be configured with static rules, like “always use Claude for analysis tasks,” or with dynamic routing that selects the model based on real-time performance data.

The Prompt Manager stores, versions, and adapts prompts. A single logical prompt can have multiple model-specific adaptations. When a request is routed to a specific model, the prompt manager retrieves the appropriate version of the prompt for that model.

The Response Normalizer converts model-specific responses into a consistent format. Different models return different structures, use different token counting methods, and format their outputs differently. The normalizer handles these differences so that the application receives a consistent response regardless of which model produced it.

The Evaluation Engine measures model performance against defined criteria. It supports automated evaluation using rubrics, comparative evaluation across models, and quality monitoring over time.

The Provider Registry

The Provider Registry is responsible for maintaining a live picture of what models are available and what state they are in.

At startup, the registry initializes connections to all configured providers. For each provider, it queries the available models and their capabilities: context window size, supported input types (text, images, audio), pricing, and rate limits. This information is cached and refreshed periodically.

Each provider connection includes:

Health checking. A lightweight health check runs every 30 seconds. If a provider fails three consecutive health checks, it is marked as degraded. If it fails ten consecutive checks, it is marked as unavailable. The registry continues checking unavailable providers and restores them when they recover.

Rate limit tracking. The registry tracks API rate limits per provider and per model. When a model is approaching its rate limit, the registry signals the router to reduce traffic to that model before requests start failing. This is proactive rate management, not reactive error handling.

Connection pooling. HTTP connections to provider APIs are pooled and reused. For high-throughput workloads, this reduces connection overhead significantly. The pool size is configurable per provider based on expected traffic volume.

Circuit breaking. When a provider starts returning errors at a rate above a configured threshold, the circuit breaker opens and stops sending requests to that provider. After a configurable cooldown period, the circuit breaker half-opens and sends a single test request. If the test succeeds, traffic resumes. If it fails, the circuit stays open.

The registry exposes a real-time view of provider health, available models, current rate limit utilization, and circuit breaker state. This data feeds into the router’s decision-making process.

The Router

The Router is the decision engine. For each incoming request, it selects the model that best satisfies the request’s requirements given the current state of available providers.

Routing decisions are made using a scoring function that evaluates each candidate model against the request’s requirements:

Task compatibility. The model must support the task type. A vision task cannot be routed to a text-only model. An audio transcription task can only go to models with audio capability. This is a hard filter, not a scoring factor.

Quality score. Based on evaluation data, each model has a quality score for each task type. If Claude scores 0.92 on analysis tasks and GPT-4 scores 0.87, Claude is preferred for analysis when quality is the priority. These scores are updated continuously as the evaluation engine processes new data.

Cost score. The estimated cost of the request on each model, based on input size, expected output size, and the model’s pricing. When a request has a cost constraint, models that exceed the constraint are filtered out. Otherwise, cost is a scoring factor.

Latency score. The expected latency for the request on each model, based on recent performance data. Some applications prioritize speed over quality. The router accounts for this by weighting latency more heavily when the request specifies a latency constraint.

Availability score. The current health and rate limit headroom of the provider. A model running at 90 percent of its rate limit scores lower than a model at 30 percent, even if the first model has a higher quality score. This prevents traffic from concentrating on a single provider and hitting rate limits.

The scoring function produces a weighted sum of these factors, with weights configurable per request and per task type. The model with the highest score is selected.

For requests where quality is critical, the router supports a confirmation pattern: send the request to two models, compare the responses, and return the one that scores higher against the evaluation criteria. This doubles the cost but provides a quality guarantee for high-stakes requests.

The router also supports fallback chains: if the primary model fails, automatically retry with the next model in the chain. Fallback chains are configured per task type and handle both hard failures (API errors) and soft failures (response quality below threshold).

The Prompt Manager

Prompts are the most underappreciated component of a multi-model system. A prompt that produces excellent results on GPT-4 might produce mediocre results on Claude, and garbage on Gemini. The differences are not just about model capability. They are about how each model interprets instructions, how it handles system messages, and what output formats it defaults to.

The Prompt Manager addresses this with a three-layer architecture:

Layer 1: Logical prompts. These are the prompts as the application understands them. A logical prompt has a name, a description, a set of input variables, and a definition of the expected output. It does not contain model-specific instructions.

Layer 2: Model adaptations. Each logical prompt has one or more model-specific adaptations. The adaptation for Claude might structure the instructions differently than the adaptation for GPT-4. System messages might be worded differently. Output format instructions might use different examples. These adaptations are maintained alongside the logical prompt and versioned together.

Layer 3: Version management. Every prompt change creates a new version. Versions are immutable. When a prompt is updated, the new version is deployed gradually: first to 5 percent of traffic, then 25, then 50, then 100. At each stage, the evaluation engine compares the new version’s quality against the previous version. If quality degrades, the rollout is paused and the previous version is restored.

The Prompt Manager also supports prompt composition: building complex prompts from reusable components. A common instruction set, like “respond in JSON format with the following schema,” can be defined once and included in multiple prompts. When the instruction set is updated, all prompts that include it are updated automatically.

Variable interpolation supports both simple substitution and conditional logic. A prompt can include different sections based on the input data: additional context when the input is long, simplified instructions when the input is straightforward. This reduces token usage for simple requests while maintaining quality for complex ones.

The Response Normalizer

Different models return responses in different formats. Some return structured JSON. Some return markdown. Some include metadata in the response. Some include it in headers. Token counts use different tokenization schemes. Finish reasons use different terminology.

The Response Normalizer converts all of this into a consistent response object:

{
  "content": "The normalized response text",
  "model": "claude-3.5-sonnet",
  "provider": "anthropic",
  "usage": {
    "input_tokens": 1250,
    "output_tokens": 340,
    "estimated_cost_usd": 0.0089
  },
  "latency_ms": 2340,
  "quality_score": null,
  "metadata": {
    "prompt_version": "v2.3.1",
    "route_reason": "highest_quality_score",
    "fallback_used": false
  }
}

The normalization includes:

Token count normalization. Token counts from different providers are converted to a common base using provider-specific tokenizer mappings. This allows accurate cost comparison across providers.

Content extraction. If the response includes preamble, caveats, or formatting that the application did not request, the normalizer strips it. This is configurable: some applications want the raw response, others want only the relevant content.

Structured output parsing. When the prompt requests structured output like JSON, the normalizer validates and parses the response. If the model returns malformed JSON, which happens more often than anyone wants to admit, the normalizer attempts repair before falling back to an error. Repair strategies include removing trailing commas, fixing unclosed strings, and extracting JSON from markdown code blocks.

Quality annotation. If the evaluation engine is configured for inline evaluation, the quality score is included in the response. This lets the application make decisions based on the assessed quality of each response.

The Evaluation Engine

The Evaluation Engine is what turns a multi-model system from a theoretical cost optimization into a practical quality improvement. Without evaluation, you do not know which model is better for which task. You are guessing.

The engine supports three evaluation modes:

Rubric-based evaluation. Define a rubric with criteria and scoring levels. The engine evaluates each response against the rubric, either using a judge model (typically a different model than the one being evaluated) or using deterministic rules for measurable criteria like format compliance, length, and factual accuracy against a reference.

Comparative evaluation. Send the same request to multiple models and compare the responses. This can be automated, using a judge model to select the better response, or manual, using a review interface where human evaluators make the selection. Comparative evaluation produces relative quality scores that feed into the router’s scoring function.

Regression testing. A suite of test cases with expected outputs. When a prompt is updated or a new model is added, the regression suite verifies that quality has not degraded. This is the safety net that prevents changes from silently degrading the system.

Evaluation results are stored in a time-series database and used to update the quality scores that the router uses for routing decisions. The system continuously improves its routing as it accumulates more evaluation data.

Integration Patterns

Applications integrate with CalliopeAI through a REST API or a Python SDK. The simplest integration is a single API call:

from calliope import Client

client = Client()
response = client.complete(
    task="analysis",
    prompt="summarize_document",
    variables={"document": document_text},
    constraints={"max_cost_usd": 0.05, "max_latency_ms": 5000}
)

The client handles provider selection, prompt retrieval, request formatting, response normalization, and error handling. The application does not need to know which model was used or how the prompt was formatted.

For applications that need more control, the SDK exposes the router, prompt manager, and evaluation engine directly. You can override routing decisions, use specific prompt versions, or run evaluations inline.

Tradeoffs and Limitations

Multi-model orchestration adds latency. The routing decision, prompt retrieval, and response normalization each add a few milliseconds. For most applications, the total overhead is 10 to 30 milliseconds, which is negligible compared to model inference time. For latency-critical applications, the overhead matters, and the router can be configured to skip optimization and use a static route.

The system adds complexity. Instead of one model provider to manage, you manage multiple providers with different APIs, different pricing, and different behaviors. CalliopeAI abstracts most of this complexity, but it does not eliminate it. When something goes wrong, debugging requires understanding the full chain: application to CalliopeAI to provider and back.

Prompt adaptation across models is not automatic. Each model adaptation must be written and validated. For a system with ten prompts and three model providers, that is thirty prompt adaptations to maintain. The Prompt Manager reduces the burden with composition and shared components, but the work of adapting prompts to different models remains a human task.

Despite these tradeoffs, the architecture provides something that no single-provider integration can: resilience, cost optimization, quality improvement through evaluation, and the ability to adopt new models as they appear without rewriting your application. For any system that will run for more than six months and where AI quality matters, the investment is justified.

CalliopeAI is the system we wished existed when we started building multi-model applications for clients. We built it because the alternative, coupling every application to a single provider and hoping that provider stays the best option forever, is a bet that has never paid off in the history of technology platforms. The models will change. The providers will change. The pricing will change. Your orchestration layer is what lets you adapt when they do.