Are you really “fine” at 1.2% error rate?

FOCUS: Are you “fine” at 1.2% error rate?

Rethinking video streaming observability with agentic investigation

Did you ever sit in front of a beautiful dashboard, staring at an error rate of 1.2%, and wonder what it actually means for your service?

Is this a good number? Should it be lower? Why is it not at zero? And more importantly, if it is too high, what can you realistically do about it?

The reality of video streaming observability

Observing a video streaming service is inherently complex, and that complexity goes far beyond simply having access to dashboards.

Even with modern observability tooling, you are dealing with multiple dashboards that each show only a partial view of reality. Data comes from across the entire delivery chain, from encoder to packager to CDN and finally the client. Bringing all of these signals together into a coherent understanding still requires a significant amount of manual correlation.

The real challenge, however, goes even deeper.

Streaming errors are not consistent. The same underlying issue can be reported differently depending on the platform, whether it is iOS, Android, a Smart TV, or a browser. On top of that, different players describe issues from their own perspective, often resulting in multiple symptoms for the same root cause. In many cases, these signals are further obscured by cryptic or misleading error codes, as anyone who has worked with AVKit will immediately recognize.

As a result, you constantly find yourself asking whether you are looking at three separate problems or at one issue that simply manifests itself in different ways.

To make sense of all this, you need a deep understanding of video streaming architectures, network behavior and ISP characteristics, as well as the quirks of different devices and client implementations. In practice, this knowledge often resides with a single engineer who is expected to interpret dashboards and decide whether that 1.2% error rate is actually something to worry about.

To avoid spending the entire day staring at dashboards, engineers typically rely on alerts to highlight potential issues. However, these alerts are often based on arbitrary thresholds that fail to capture the nuances of real-world streaming behavior. As a result, long tail delivery issues remain invisible, while both dashboard fatigue and alert fatigue continue to grow. In this situation, the last thing anyone needs is yet another tool that presents the same data from a slightly different angle.

There must be a better way

While working with our customers on detecting and debugging streaming incidents, we kept coming back to a simple but fundamental question.

Why do humans have to actively search for issues in the first place? Why should it not be the other way around, where the issue finds you?

Meet Focus: an agentic approach to observability

This thinking led us to build something fundamentally different.

Instead of following the traditional flow from dashboards to human interpretation to insights, we built:

Focus — an ML and AI based video streaming observability agentic framework

We believe you should not have to find the problems. The problems should find you. Focus enables exactly this.

Data is the foundation, but not the solution

Capturing detailed streaming telemetry is the foundation of any observability system.

Standards such as CMCDv2 will play an important role in bringing more structure and consistency to this data, enabling a more unified view across the ecosystem.

However, even with perfectly structured and complete data, the core problem remains.

Data alone does not solve the problem.

Finding the invisible through ML driven clustering

To move beyond dashboards, we built a set of machine learning algorithms that identify clusters of interest across all streaming sessions.

Instead of focusing on predefined thresholds, these algorithms look at the data from multiple perspectives and uncover patterns that would otherwise remain hidden. This can include degraded ISP performance in a specific region, a failing media item buried deep in the catalog with only a handful of daily views, DRM subtitle issues affecting a very specific device configuration, or delivery inconsistencies in a multi CDN setup.

These are real world issues that impact actual viewers, yet they remain hidden behind an aggregated error rate such as 1.2%. In practice, they are almost never discovered manually, simply because no engineer will spend hours slicing and filtering dashboards at that level of granularity.

Grounded in domain knowledge

Detecting patterns is only useful if they can be interpreted correctly. Every investigation in Focus is enriched with two layers of knowledge.

The first is a general understanding of video streaming, covering how protocols behave, which codecs are supported on which platforms, and what typical failure patterns look like. The second, and arguably more important layer, is deeply integrated customer specific context.

A good example is geo delivery. For one of our customers, the knowledge base defines exactly which countries streams are licensed to be delivered to. Sessions from outside these regions are intentionally blocked. Without this context, playback failures from a blocked country would look like a real incident. With it, the agents immediately recognize the behavior as expected and move on, avoiding a false alarm that would otherwise waste engineering time.

The same principle applies across the board, whether it involves known client quirks on a specific device, framework behaviors that produce misleading error codes, or platform specific edge cases in a multi CDN setup. By embedding this operational knowledge directly into the system, we ensure that each investigation is interpreted in the correct context.

In essence, we codified the knowledge that a senior video streaming engineer would need in order to correctly interpret all of this data.

From signals to real issues through agentic investigation

Not every detected anomaly is relevant. Some signals are simply noise, such as users on unstable WiFi connections, bots crawling the catalog, or short lived CDN delivery issues.

This is where the agentic investigation framework becomes critical.

Armed with this context, a set of specialized AI agents investigates each detected cluster in depth, rather than immediately triggering alerts. Each agent focuses on a specific aspect of the problem:

- A session analysis agent that inspects playback sessions in detail and reconstructs what actually happened from the client perspective
- A web intelligence agent that looks beyond the system and checks whether similar issues have been reported elsewhere
- A video analysis agent that validates manifests, codecs, DRM behavior, and media integrity
- A playback agent that attempts to reproduce the issue on real devices, for example using BrowserStack or reference players
- Additional agents that focus on specific aspects such as CDN behavior, timing patterns, or asset-level anomalies

From detection to action

Once a signal has been validated as a relevant issue, the system moves beyond detection.

It automatically creates a fully enriched ticket that includes a clear description of the problem, its impact on users, and the most likely root cause. This information is then delivered directly into the tools that teams already use, whether that is Jira, Slack, Teams, email, or other systems.

There is no need to switch environments, as the system integrates directly into existing workflows.

The impact

With this approach, teams are able to detect issues before users report them and significantly reduce detection and response times. In real world scenarios, we have seen reductions of up to 82 percent.

Real issues found in the last two weeks

The following examples illustrate the kind of issues that were identified recently. None of them were major outages, but all of them had a measurable impact on user experience.

- A scheduled Origin Shield update at 4:05 in the morning that caused session drops of up to 64 percent on a daily basis
- DRM encrypted subtitles that broke playback on a specific set top box when subtitles were enabled
- A missing media cohort on one CDN in a multi CDN delivery setup
- Subtle HTTP 504 error spikes that led to 4 to 9 percent of sessions failing at regular intervals of 30 minutes

None of these issues were obvious from dashboards alone, and in many cases they would have gone unnoticed for a long time.

Why not just use AI?

A common question is why not simply apply AI to the data and let it figure things out. The reality is that without the right structure, context, and domain specific knowledge, this does not work in practice.

👉 Ralf explains this in more detail here.

See it in action

If you are interested in seeing how this works in practice, we will be showcasing the solution at NAB. Alternatively, we are happy to provide a personal walkthrough at any time.

Why this matters

At the end of the day, addressing these kinds of real world, long tail delivery issues is what separates good streaming services from truly great ones.

If you want to compete in the Champions League of video streaming, this level of observability is not optional. Netflix and YouTube already operate at this level, supported by large engineering teams and significant resources.

Whether we like it or not, these are the benchmarks, and they are also your competitors.

Final note

A huge thank you to the whole team, and especially to Andreas Sapountzis and László Schmidt, for building this continuously evolving system.

By moving away from dashboards and toward automated, actionable insights, we enable engineers to focus on what really matters, which is improving the quality and reliability of their streaming services.

Curious to learn more? Let’s talk.