
Most engineering organizations approach observability like this: deploy Datadog (or Grafana, or New Relic), instrument services, create some dashboards, set up some alerts. Done. Observability is “solved.”
It is not solved. It is started. And the gap between “started” and “useful” is where most observability investments stall.
The problem is not the tooling. Modern observability tools are capable. The problem is the approach. Observability is treated as an infrastructure concern – something the platform or SRE team sets up and developers use when things break. This produces a collection of dashboards that are useful during incidents but invisible the rest of the time. The default state of observability in most organizations is ignored.
Observability should be treated as an internal product. It has users (engineers, managers, on-call responders). It has a user experience (how easy is it to go from “something is wrong” to “here is what is wrong and here is what to do about it”). It has a product lifecycle (discovery, adoption, engagement, iteration). When you treat observability as a product, you build something that engineers use every day, not just during outages.
Dashboards are the default deliverable of observability efforts, and they are the wrong abstraction for most use cases.
A dashboard shows you what is happening. It does not tell you what is wrong, why it is wrong, or what to do about it. A dashboard with 30 panels showing CPU utilization, memory usage, request rates, error rates, latency percentiles, and queue depths is comprehensive but not useful. An engineer looking at this dashboard during an incident needs to:
This process takes 5-30 minutes for experienced engineers and longer for less experienced ones. During an incident, those minutes matter. And this is the optimistic case where the right dashboard exists and the right panels are on it.
The alternative is to build observability that answers questions, not just displays data.
The shift from dashboards to observability-as-product follows a maturity curve:
You are collecting logs, metrics, and traces from your services. You have a centralized observability platform. This is the baseline, and most organizations achieve it.
You have dashboards for key services and alerts for critical thresholds. Engineers use dashboards during incidents and ignore them otherwise. Alerts fire frequently enough that some are ignored (alert fatigue). This is where most organizations plateau.
Alerts include context: what changed, what is affected, who owns it, what the recent deployment history looks like. Instead of “Error rate on service X exceeded 5%,” the alert says “Error rate on service X exceeded 5% starting at 14:23. Deploy abc123 went out at 14:20 affecting the /checkout endpoint. Service owner: payments-team. Runbook: [link].” This level requires correlation between observability data and change management data.
When an alert fires, an automated investigation runs immediately. The system checks recent deployments, upstream dependency health, infrastructure status, and historical patterns. By the time a human responds to the alert, a preliminary investigation report is ready. This does not replace human judgment – it accelerates it.
The system identifies problems before they become incidents. Traffic patterns suggest an upcoming capacity issue. Error rates are trending upward but have not hit the alert threshold yet. Latency is gradually increasing in a way that correlates with a known memory leak pattern. The system surfaces these observations proactively, not reactively.
Most organizations are at Level 2. The gap between Level 2 and Level 4 is not tooling – it is product thinking. Building an observability product means intentionally designing the experience at each level and investing in the transitions.
Observability has distinct user personas with different needs:
On-call responders need fast path from alert to root cause. They want: what is broken, since when, what changed, and what should I do. They are working under time pressure during incidents.
Service owners need health visibility for their services during normal operations. They want: is my service healthy, are there any trends I should worry about, how did the last deployment affect performance.
Engineering managers need aggregate health metrics and trend data. They want: how reliable are our systems, are we improving or degrading, where should we invest engineering effort.
Platform engineers need infrastructure-level visibility and capacity planning data. They want: how are shared resources performing, where are the bottlenecks, what capacity do we need next quarter.
Each persona needs a different interface, different data, and different level of detail. A single dashboard does not serve any of them well.
Instead of building dashboards for services, build workflows for tasks:
Incident investigation workflow. Starting from an alert, guide the responder through a structured investigation: what changed, what is affected, what is the blast radius, what are the likely causes, what are the remediation options. Each step in the workflow presents relevant data without requiring the responder to know which dashboard to check.
Deployment verification workflow. After a deployment, automatically compare key metrics (error rate, latency, throughput) between the pre-deployment and post-deployment periods. Surface any statistically significant changes with context about what was deployed. This replaces the manual process of checking dashboards after each deploy.
Service health review workflow. A weekly or monthly view of service health trends: reliability metrics, latency trends, error budget consumption, dependency health. This is the interface for service owners doing routine health checks rather than incident response.
Capacity planning workflow. Historical resource utilization with trend projection. Identify services approaching capacity limits before they become performance issues. This is the interface for platform engineers and managers.
If observability is a product, it should have product metrics:
Time to detection (TTD). How long between a problem starting and someone being alerted? Lower is better. This measures the effectiveness of your monitoring and alerting.
Time to investigation (TTI). How long between an alert firing and an engineer having a preliminary hypothesis? This measures the quality of your contextual alerting and investigation tooling.
Time to resolution (TTR). How long between detection and resolution? This is the end-to-end metric, influenced by TTD, TTI, and the actual fix time.
Alert signal-to-noise ratio. What percentage of alerts correspond to real problems? Low signal-to-noise drives alert fatigue, which undermines the entire observability system.
Daily active users. How many engineers interact with the observability system on non-incident days? High daily usage indicates the system provides value beyond incident response – it is part of the engineering workflow.
Track these metrics and optimize for them the same way you would optimize a customer-facing product.
The intersection of AI and observability is one of the most practical applications of AI in engineering operations. The data volumes in modern observability are too large for human processing, and the patterns are too complex for static rules.
Traditional alerting uses static thresholds: alert if error rate exceeds 5%. This produces false positives when normal traffic patterns cause expected metric variations (weekend dips, nightly batch processing spikes) and false negatives when a problem slowly pushes metrics toward the threshold without crossing it.
AI-powered anomaly detection learns normal patterns for each metric and alerts on deviations from the learned baseline. A 3% error rate at 2 AM might be normal (low traffic, higher per-request variance). The same 3% error rate at 2 PM might be anomalous (high traffic, normally below 1%).
The implementation does not require exotic ML. Seasonal decomposition, isolation forests, and simple statistical methods (rolling z-scores) handle the majority of anomaly detection use cases. The key is training per-metric baselines that account for time-of-day, day-of-week, and seasonal patterns.
class AnomalyDetector:
def __init__(self, metric_name: str, lookback_days: int = 30):
self.metric_name = metric_name
self.baseline = self._build_baseline(lookback_days)
def is_anomalous(self, current_value: float, timestamp: datetime) -> bool:
expected = self.baseline.expected_value(timestamp)
std_dev = self.baseline.expected_std(timestamp)
z_score = abs(current_value - expected) / max(std_dev, 0.001)
return z_score > 3.0 # Configurable threshold
def _build_baseline(self, lookback_days):
historical = metrics_store.query(
self.metric_name,
start=datetime.utcnow() - timedelta(days=lookback_days)
)
return SeasonalBaseline.fit(historical)
When an anomaly is detected, AI can accelerate root cause identification by:
Correlation analysis. Automatically identify other metrics that changed at the same time. If service A’s error rate spiked at the same time as service B’s latency increased, there is likely a causal relationship. Correlating across hundreds of metrics in seconds is trivial for a machine and impossible for a human.
Change correlation. Cross-reference anomaly timing with change events: deployments, configuration changes, infrastructure modifications, provider updates. The change most proximate to the anomaly onset is the most likely cause.
Historical pattern matching. Compare the current anomaly signature (which metrics are affected, how they are affected, what the temporal pattern looks like) against historical incidents. If this pattern has been seen before, surface the previous root cause and resolution as a starting point.
LLM-powered synthesis. Feed the correlation analysis, change data, and historical patterns into a language model to produce a human-readable investigation summary. The LLM does not diagnose the problem – it synthesizes the available evidence into a coherent narrative that the on-call engineer can act on quickly.
AI systems add observability requirements beyond traditional services:
Model performance monitoring. Track model-specific metrics: inference latency, token usage, output quality scores, hallucination indicators, confidence distributions. These metrics are not standard in observability platforms and require custom instrumentation.
Prompt and context quality. Monitor the quality of inputs to your AI components: retrieval relevance scores, context window utilization, prompt token counts. Degradation in input quality predicts degradation in output quality, often before the output degradation is measurable.
Cost observability. AI API costs are a first-class operational concern. Track cost per request, cost per user, cost per feature. Alert on cost anomalies (a bug that causes infinite retry loops will show as a cost spike before it shows as an error rate increase).
Provider performance. Track per-provider metrics: availability, latency, quality, cost. This data feeds into multi-model routing decisions and capacity planning.
At CONFLICT, we treat AI observability as a core component of every system we build. CalliopeAI, our AI workbench, includes built-in observability for model calls, token usage, and quality metrics. This is not an afterthought – it is part of the architecture because we have learned that AI systems without observability are AI systems that fail silently.
If you are starting an observability-as-product initiative, here is a practical roadmap:
Quarter 1: Contextual alerts. Enrich existing alerts with context: recent deployments, dependency status, service ownership, runbook links. This is the highest-impact, lowest-effort improvement. Engineers spend less time investigating because the alert itself contains the starting context.
Quarter 2: Automated investigation. Build automated investigation workflows that run when alerts fire. Start with the most common alert types and build investigation logic that checks the usual suspects (deployments, dependency health, traffic patterns). Deliver the investigation results alongside the alert.
Quarter 3: Anomaly detection. Replace static threshold alerts with learned baselines for key metrics. This reduces false positives (alert fatigue) and catches slow-moving degradation that static thresholds miss.
Quarter 4: Proactive insights. Build trend detection and capacity forecasting. Surface problems that are developing but have not triggered alerts yet. This shifts the observability posture from reactive to proactive.
Each quarter delivers user-facing value. Each quarter builds on the previous one. This is how product development works, and it is how observability development should work too.
When observability is treated as a product, the outcomes are measurable:
These outcomes compound. Engineers who trust their observability system catch problems earlier, resolve them faster, and prevent more incidents. Engineers who do not trust their observability system ignore alerts, rely on user reports to detect problems, and spend incident time hunting through dashboards.
Observability is not a dashboard. It is not a tool you deploy and forget. It is an internal product that deserves the same product thinking, user research, and iteration cycles that you give to your customer-facing products. Build it like a product, and your engineers will use it like one.