Chaos Engineering for AI Systems: Why We Built Firedrill

Chaos engineering has a simple thesis: inject controlled failures into a system to discover weaknesses before they become incidents. Netflix popularized this with Chaos Monkey, which randomly terminates production instances to ensure the system handles failure gracefully. The practice has matured into a discipline with established tools, methodologies, and a growing body of practice.

But AI systems have failure modes that traditional chaos engineering does not address. Servers crash and networks partition – those are binary failures that existing tools test well. AI systems fail in subtler, more insidious ways: model quality degrades gradually, prompts drift as models are updated, embeddings become stale as data changes, providers have latency spikes that cascade through multi-step pipelines, and hallucination rates increase without any infrastructure failure.

These are the failures that wake you up at 3 AM. Not because the system is down, but because it is up and producing wrong answers at scale. This is why we built Firedrill – to bring chaos engineering discipline to AI-specific failure modes.

Why AI Systems Fail Differently

Traditional software systems have a useful property: they either work or they do not. A database connection either succeeds or throws an error. An API either returns a valid response or returns an error code. Binary states are easy to monitor and easy to test.

AI systems operate on a continuum. A language model does not error when it hallucinates – it returns a confident, well-formatted response that happens to be wrong. An embedding model does not crash when its quality degrades – it returns vectors that are slightly less useful than they were last month. A provider does not go down when it gets slow – it returns responses in 8 seconds instead of 2 seconds, and your multi-step pipeline that assumed 2-second responses now takes 40 seconds.

These degradation modes compound. A 10% increase in hallucination rate across a 5-step agentic pipeline does not produce a 10% worse outcome – it produces a combinatorial explosion of potential errors. A slight increase in provider latency does not slow one request – it creates a backlog that triggers rate limits that cause retries that amplify the original problem.

The traditional SRE toolkit – health checks, error rate monitoring, uptime tracking – catches the binary failures. It misses the degradation failures entirely. Your dashboards show green while your AI system quietly produces garbage.

AI-Specific Failure Modes

Through building and operating AI systems for our clients, we have cataloged the failure modes that matter most:

Model Degradation

Foundation model providers update their models continuously. These updates usually improve overall performance, but they can degrade performance on specific tasks, domains, or prompt patterns. A prompt that worked perfectly with GPT-4-0613 might produce different (worse) results with a subsequent update. The provider does not notify you of these changes, and your system has no way to detect them without explicit evaluation.

What to test: Run your evaluation suite against your production prompts on a regular schedule. Compare results against baseline metrics. Alert when accuracy, format compliance, or other quality metrics drop below thresholds.

Prompt Drift

As models change, the effectiveness of specific prompting strategies shifts. Chain-of-thought prompting, few-shot examples, system prompt instructions – all of these interact with the model’s training, and changes in the model can change how these techniques perform. A carefully tuned system prompt might become less effective after a provider update without any change on your side.

What to test: Track output quality metrics continuously. When quality metrics degrade without any changes to your system, the cause is likely prompt drift triggered by a model update.

Embedding Staleness

Embedding models map text to vectors. When you update your embedding model (or when your data changes significantly), old embeddings no longer align with new queries. A document embedded six months ago with a previous model version does not live in the same vector space as a query embedded today with the current version. Retrieval quality degrades silently.

What to test: Run a set of known-good queries against your retrieval system on a schedule. Measure retrieval precision and recall. When metrics degrade, check whether the embedding model has changed or whether new data has been added that was not properly embedded.

Provider Outages and Latency Spikes

AI provider outages are common. Partial outages – where the API responds but with elevated latency or reduced quality – are even more common and harder to detect. A provider responding in 5 seconds instead of 1 second does not trigger an error, but it cascades through your pipeline.

What to test: Inject artificial latency into provider calls. Simulate complete provider outages. Verify that your fallback chains activate correctly and that degraded mode provides acceptable service.

Hallucination Spikes

Hallucination rates are not constant. They vary by input type, context length, model state, and factors that are not well understood. A system that hallucinates at a 2% rate on Monday might hallucinate at a 10% rate on Wednesday because the distribution of user inputs shifted slightly.

What to test: Monitor hallucination indicators (factual accuracy, source grounding, self-consistency) continuously. Inject adversarial inputs designed to trigger hallucination and verify that your detection mechanisms catch them.

Context Window Overflow

As conversations grow or retrieval returns more results, the context window fills up. When context is truncated or summarized to fit the window, important information can be lost. The system continues to respond but without critical context, and the responses degrade.

What to test: Simulate long conversations and large retrieval results that push against context limits. Verify that your context management strategy (truncation, summarization, prioritization) maintains response quality.

The Firedrill Approach

Firedrill is our tool for testing AI system resilience against these failure modes. The design philosophy follows three principles:

Controlled injection. Every chaos experiment is defined declaratively, with clear parameters for what is being injected, how severe the injection is, and what the blast radius looks like. No surprise failures – every experiment is intentional and bounded.

Measurable outcomes. Every experiment tracks specific metrics before, during, and after the injection. The goal is not to cause failures but to observe how the system behaves under stress and whether existing safeguards work.

Gradual escalation. Start with mild injections and increase severity. Inject 100ms of extra latency before injecting 5 seconds. Degrade model quality slightly before simulating a full outage. Gradual escalation reveals the point at which the system transitions from healthy to degraded to failed.

Experiment Types

Firedrill supports several categories of chaos experiments for AI systems:

Provider chaos. Simulate provider failures: complete outages, elevated latency, rate limit errors, partial degradation (responses that are technically valid but lower quality). Verify that fallback chains activate, that circuit breakers trip at the right thresholds, and that the system degrades gracefully.

experiment:
  name: "anthropic-latency-spike"
  target: provider.anthropic
  injection:
    type: latency
    delay_ms: 3000
    jitter_ms: 1000
    duration_minutes: 15
  metrics:
    - pipeline_latency_p99
    - fallback_activation_count
    - user_facing_error_rate
  success_criteria:
    - pipeline_latency_p99 < 5000
    - user_facing_error_rate < 0.01

Retrieval chaos. Degrade retrieval quality by injecting irrelevant results, removing relevant results, or shuffling result rankings. Verify that the system’s response quality degrades gracefully rather than catastrophically, and that low-confidence retrieval triggers appropriate fallback behavior.

Model quality chaos. Replace the production model with a weaker model (or inject systematic errors into model responses) to simulate model degradation. Verify that quality monitoring detects the degradation and that automated responses (alerting, threshold tightening, traffic rerouting) activate.

Context chaos. Inject oversized contexts, empty contexts, contradictory contexts, or stale contexts to test context management robustness. Verify that the system handles edge cases without producing confidently wrong outputs.

Load chaos. Spike request volume to test auto-scaling, rate limit management, and queuing behavior under load. AI systems have unique load characteristics because each request consumes significant compute, and concurrent requests compete for provider rate limits.

Running Experiments

A Firedrill experiment follows a structured workflow:

Baseline measurement. Run the target system under normal conditions and record baseline metrics. These baselines are the reference point for evaluating experiment impact.
Injection. Apply the chaos injection with defined parameters. The injection runs for a specified duration or until manual termination.
Observation. Monitor system metrics in real time during the injection. Track the metrics defined in the experiment specification.
Analysis. Compare injection-period metrics against baseline metrics. Evaluate against success criteria. Identify any unexpected behaviors.
Remediation. If the experiment reveals weaknesses, document them and prioritize fixes. Re-run the experiment after fixes to verify improvement.

class FiredrillExperiment:
    def __init__(self, config: dict):
        self.config = config
        self.metrics_collector = MetricsCollector()
        self.injector = ChaosInjector(config["injection"])

    def run(self):
        # Phase 1: Baseline
        baseline = self.metrics_collector.collect(
            duration=self.config.get("baseline_minutes", 10)
        )

        # Phase 2: Injection
        self.injector.start()
        injection_metrics = self.metrics_collector.collect(
            duration=self.config["injection"]["duration_minutes"]
        )
        self.injector.stop()

        # Phase 3: Recovery
        recovery_metrics = self.metrics_collector.collect(
            duration=self.config.get("recovery_minutes", 10)
        )

        # Phase 4: Analysis
        report = self.analyze(baseline, injection_metrics, recovery_metrics)
        return report

    def analyze(self, baseline, injection, recovery):
        results = {}
        for criterion in self.config["success_criteria"]:
            results[criterion] = self.evaluate_criterion(
                criterion, baseline, injection, recovery
            )
        return FiredrillReport(results)

Building a Chaos Practice for AI

If you are new to chaos engineering for AI systems, here is how to get started:

Phase 1: Observability First

You cannot test resilience if you cannot measure it. Before running any chaos experiments, ensure you have:

Response quality metrics (accuracy, hallucination rate, format compliance) tracked continuously
Provider-level metrics (latency, error rate, token usage) per model and per provider
Pipeline-level metrics (end-to-end latency, cost per request, step success rates)
Retrieval quality metrics (precision, recall, relevance scores)

Without these metrics, chaos experiments produce anecdotes (“the system seemed slower”) rather than data (“p99 latency increased from 2.1s to 7.3s and fallback activated after 45 seconds”).

Phase 2: Start With Provider Chaos

Provider failures are the most common and the easiest to simulate. Start here:

Simulate a complete outage of your primary AI provider. Does your fallback chain activate? Does it produce acceptable quality?
Inject 3x latency on your primary provider. Does your circuit breaker trip? At what threshold?
Simulate rate limit responses (429 errors) at increasing frequencies. How does your rate limit management handle the escalation?

These experiments reveal whether your basic resilience infrastructure works. Most teams discover that their fallback chains have never actually been triggered in production and contain bugs that only surface under real failure conditions.

Phase 3: Test Quality Degradation

Once provider chaos is handled, test the subtler failure modes:

Replace your production model with a significantly weaker model. How quickly do your quality metrics detect the degradation?
Inject stale embeddings into your retrieval system. Does retrieval quality monitoring catch the degradation?
Run adversarial inputs through your system. Do your hallucination detection mechanisms trigger?

These experiments test whether your monitoring catches the failures that do not produce errors – the silent degradation that is uniquely dangerous in AI systems.

Phase 4: Regular Practice

Chaos engineering is not a one-time audit. It is an ongoing practice:

Run provider chaos experiments monthly. Provider behavior changes over time, and your resilience mechanisms need to keep pace.
Run quality degradation experiments after every model update (yours or the provider’s).
Run load chaos experiments before expected traffic increases.
Review experiment results in team retrospectives. Each experiment should either confirm that resilience works or produce a remediation ticket.

What We Have Learned

Building Firedrill and using it on our own systems and client systems has taught us several things:

Fallback chains are almost always broken the first time you test them. Teams configure fallback models and assume they work. The first chaos experiment reveals misconfigurations, compatibility issues, and quality gaps that only surface under actual failover.

Quality monitoring has too much latency. Most teams monitor AI quality through periodic batch evaluation (daily or weekly). When a model degrades, it can produce garbage for hours before the evaluation catches it. Real-time quality sampling – evaluating a percentage of live responses continuously – catches degradation much faster.

Recovery is as important as resilience. How quickly does the system return to normal after a provider recovers? If your circuit breaker trips during a provider latency spike, does it automatically reset when latency returns to normal? Or does it stay tripped, routing all traffic to the fallback, until someone manually resets it? Recovery automation is as important as failover automation.

Compound failures are the real risk. A single provider outage is manageable. A provider outage combined with a retrieval quality degradation combined with a traffic spike is not. Multi-failure chaos experiments reveal the compound scenarios that overwhelm resilience mechanisms designed for single failures.

AI systems need chaos engineering more than traditional systems, not less. The failure modes are subtler, harder to detect, and more damaging because they produce wrong answers rather than error messages. If you are running AI in production and you have not tested what happens when things degrade, you are trusting luck. Luck is not an engineering strategy.

Firedrill exists because we needed to stop trusting luck with our own systems. It is the tool we wish we had before our first production AI failure, not the tool we built after it.

posted by admin

Dec 20, 2025 - 11