cra
mr

The Problem with OpenTelemetry

Edit: tl;dr tracing should exist in every ecosystem, and be broken out of OpenTelemetry.

I regularly complain about OpenTelemetry, so with an aim to be a less useless contributor, today I’m putting pen to paper. If you’re an implementer, I ask you to read this and take away the personal bias you might have towards your work, and instead look objectively at the feedback being given.

First, some context, if you’re clueless how you ended up here, I started Sentry. Sentry, for obvious reasons, has a stake in this “instrument your application” race. That said, with everything I’ve seen, I cannot say OpenTelemetry is the horse I want to put my money on. That’s where this conversation starts.

In 2015 Armin and I built a spec for Distributed Tracing. It’s not a hard problem, it just requires an immense amount of coordination and effort. At its core it’s structured events that carry two GUIDs along with them: a trace ID and a parent event ID. It is just building a tree. The harder part is stabilizing those annotations across an infinitely growing ecosystem of applications, libraries, and various services. Around the same time we were working on our spec, the folks at Lightstep were kicking things into gear. They built an open specification called OpenTracing. The goal, as far as I understand it, was to create a standard approach to instrumentation allowing vendors and customers to avoid rewriting instrumentation.

The goal makes sense to both versions of myself: a leader at Sentry and a developer of software.

Along the way OpenTracing became OpenTelemetry, merging with OpenCensus. I don’t have a lot of interest in how that decision was made, or what the hypothetical goals were, but I care about the outcomes of the original goal of an open tracing standard. You see, sometime along the way, OpenTelemetry became a much wider reaching specification beyond the tracing goals. It tries to build specifications and SDKs for not only span annotations, but universal transports for logs and metrics, and who knows what else in the future. It is my personal opinion - not that of Sentry’s - that it has lost its way, and is a classic example of the failure of design by committee. It’s an example of a lack of vision, and a lack of leadership.

Now I say that not knowing the people involved in the project, so don’t consider this an attack on the quality of their work, their desire to help the ecosystem, or anything of that matter. The way I look at it is its a bunch of individuals and corporations trying to service their own goals, with some degree of overlap. Many of them are vendors doing whatever they can to achieve PmF, begging the OpenTelemetry gods to feed them more data and customers. A lot of others are the outcasts of the Datadogs, looking to bring their instrumentation to another partner who doesn’t want to middle finger them every quarter when the earnings calls come in.

The problem I see is that OpenTelemetry doesn’t have an end state that we’d all agree upon, and I personally do not believe strategic alignment nor vision are possible without being able to articulate an end state. When OpenTracing started I could envision what that might have been. I could see the issue that was trying to be solved. Now all I see is a standards committee, for something that’s not remotely a standard, and I struggle to ever see becoming one. I stated I wanted this to be more than complaining, so I’m going to articulate a problem, a solution, and why OpenTelemetry is failing to live up to the goal I’ve stated above (which may not be their goal, but certainly reflects what our customers want).

Back in the early 2000s if you wanted to reasonably debug performance issues in production systems you were left with two big tasks:

  1. Identifying a vendor or building a centralized APM-like solution
  2. Annotating your code all over the place with span-esque data structures

The problem fundamentally came down to the annotations. A lot of them were actually specific to vendor SDKs. None of that instrumentation was portable, well at least not at face value. Realistically a bunch of codemods would easily let you swap out a vendor specific set of annotations with another vendor’s, so it wasn’t as bad as people make it out to be. Fast forward however, and technology has gotten far more complicated. Everyone and their mother is running a shoddy microservice-coupled stack, creating a lot more of a challenge in debugging these production systems. Even outside of performance concerns, something as simple as a stacktrace for an error is often not enough anymore. So we needed something better. We needed the tried and true tracing techniques to become readily available to you and I.

Now from a vendor point of view, nothing really changed. Vendors are still responsible for ensuring that your application is well instrumented. That’s not your problem, at least not entirely. So vendors got together and said, you know what’d be great? We build this set of standards, rope in third parties to implement them, and solve the instrumentation angle once and for all. That’s a great goal, but practically speaking no one has really committed to that. Authors of libraries rarely instrument their code, and the upstream instrumentation (often via monkey patching) looks more or less the same as it has for 20 years.

On top of that, one of the goals is to make it so you, as a developer, have portable instrumentation that goes between vendors. I’m all for this goal, but if you’ve looked at the APIs and specs made available via OpenTelemetry its fairly obvious why adoption is struggling. There are so many concepts you have to master, and even as an experienced developer you’re going to rightly, and quickly, question why some of them are relevant to you. You as a customer almost certainly should only ever need to care about span instrumentation, and in some cases, forwarding baggage (particularly forwarding the trace ID). Everything else is a vendor’s problem, but the spec is plagued with these concepts that must be considered by anyone implementing span annotations, which is not the vendor.

To take it even further, OpenTelemetry is so far beyond tracing, even though it has yet to achieve traction within that original scope. Its trying to create standards for logging and metrics, neither of which exist in the context of many systems. Logs are just events - which is exactly what a span is, btw - and metrics are just abstractions out of those event properties. That is, you want to know the response time of an API endpoint? You don’t rewind 20 years and increment a counter, you instead aggregate the duration of the relevant span segment. Somehow though, Logs and Metrics are still front and center. Feels like BigMonitoring trying to keep relevant personally, but I’m not going to digress on this topic much. What really matters here is one spec has amounted to many specs, so even saying “OpenTelemetry” has no real meaning.

Let’s talk about practicalities. Sentry wants you to have portable instrumentation. Sentry also wants you to have the best data quality possible. The former is pretty easy frankly, so I’m not going to focus on it, but third party instrumentation is where many systems break down. OpenTelemetry, specifically its tracing abstractions, has actually done a really good job in some ecosystems of providing baked-in instrumentation for libraries. Node.js is the one we’re going to focus on today, but its worth noting that while some ecosystems have good implementations of OpenTelemetry, its largely non-existant or sees little adoption.

So going back to our goal of great data quality, Sentry made the decision that we should be compatible with OpenTelemetry. That is, we want to support the effort of great third-party instrumentation, but that’s not enough for us. We think there’s inherent value in making everything trace-connected. Yes, spans are great, but you know what else is great? Crash reports, Session replays, and a bunch of other kinds of telemetry that we’ve all yet to realize matters. Those are not spans, and they certainly arent logs or metrics. Our goal is to make that possible, but to do that, we need two things:

  1. Trace propagation
  2. Span instrumentation (which is key to propagation)

For a couple years now we’ve been experimenting with how to achieve this, how to support OpenTelemetry users, but without becoming yet another generic vendor who just accepts a log drain of junk data. That’s simply not what we want to be, and its not what we hear our 100,000+ customers needing. So we ended up with our current generation JavaScript SDK, which piggybacks on top of OpenTelemetry, ideally giving you the best of both worlds.

What I mean by that is, as a customer, you should be able to do this:

import { startSpan } from "@opentelemetry/sdk";

function shittyChatBot() {
    startSpan("some.operation", (span) => {
        doTheThing();
    });
}

That is, you should be able to use a shared specification to instrument your code for portability, and third party vendors should be able to do the same. We’re in favor of that future. Unfortunately it has not been going super well, and that is why I’m constantly complaining. I hear the complaints of our customers, I see an outcome they want, and I’m struggling to help them get to it.

We as a vendor have what I would simply describe as courage: we have no fear of solving whatever problem gets in our way. We don’t expect others to do our work for us, we control the ball. That means we’re willing to re-invent all instrumentation if that’s what it takes, but we’d prefer not to, and we still want customers to have portability in instrumentation.

In particular, we have an extremely well adopted and powerful SDK, and we want our customers to be able to get both the advantages of our SDK - remember we want everything trace connected, not just whatever is in this design committee - but also not be prevented from adopting OpenTelemetry. That means what we actually want is a way to say “hey OpenTelemetry SDK, give us all the current spans in the buffer”. What we don’t want is to be forced to implement some hopeful-standardized transport protocol that may or may not solve our concerns, let alone one that requires us to adopt legacy infrastructure telemetry like logs and metrics.

So to do that we piggyback on the SDK, trying to use their context resolvers, baggage handlers, and most importantly, the span instrumentation code.

Except it doesn’t work for customers. It doesn’t work for the same reasons a lot of other things don’t work: version conflicts, specification incompatibilities, and generally speaking, code that tries to do too much. This headache also comes in regular conversation. “We’re supporting OpenTelemetry” is immediately confusing as we don’t do logs, or traceless-metrics.

Alright, enough complaining, here’s what I’d like to see.

First, we have a tracing-focused SDK that is as lightweight as possible. That means it entirely focuses on giving customers and library authors the ability to instrument their code with span annotations. Not metrics. Not logs. Let that be a different spec’s problem. That API probably continues to look similar to today:

import { startSpan } from "tracing";

function shittyChatBot() {
    startSpan("some.operation", (span) => {
        doTheThing();
    });
}

Second, that code, as much as possible, does not add to your runtime performance or bundle size unless you’ve opted in to a method of collection. That collection should not be a wire protocol, but an application interface. For example, you should still be able to wire up an OpenTelemetry collector as you do today:

import { newDomainSpecificCollector } from "@opentelemtry/api";

newDomainSpecificCollector({
    endpoint: "http://not.sentry/otel-ingest";
});

Lastly, all those other non-tracing concerns? Put them whenever you want, but just like the transport protocols, take them out of the core.

If this were to happen, Sentry could then completely ignore the OpenTelemetry wire protocols, collectors, and everything else that frankly doesn’t matter to our customers or our product. That means something would have to be exposed to handle it:

import { onSpanFinish } from "tracing";
import { collectSpan } from "@sentry/node";

onSpanFinish.addEventListener(collectSpan);

Thats it. Thats my proposal. I’m sure I’m glossing over a bunch of things, but I’m not trying for academic correctness. Consider this feedback to the “OpenTelemetry Head of Product”, from a customer who would love to use and recommend your product, but cannot.

The world I’m describing is a world Sentry would get behind both as an implementer, and as a financier. I’d love to go to our board and tell them we need to allocate $10m in funding to support library authors implementing and maintaining a true standard, one adopted across the board, but today we cannot make that bet. So instead we are continuing to hedge our bet on us having to maintain instrumentation, and the OpenTelemetry committee somehow becoming successful. Maybe thats fine, but it mostly seems like a big distraction that doesn’t help our customers.

I want to close with one last point: its ok that people have problems they want to solve, and its ok that they work as a group to solve them, but you’re not going to see adoption of a product if its solving problems that someone doesn’t have. Sentry does not have the problem of “extracting metrics from AWS CloudWatch”, and I would encourage implementers to stop conflating concerns that are not tightly coupled to the problem being addressed.

There are a lot of things OpenTelemetry tries to do, some it does well, but in general is plagued by too many opinions, and too many goals. Maybe bring back OpenTracing?

More Reading

wtb: Progressive SPAs

You're Not a CEO

Enterprise is Dead

Secure Yo Self

Open Source and a Healthy Dose of Capitalism