cra
mr

The Scale of Monitoring

March 17, 2022

I wrote this email to the Sentry team after a few internal conversations. Originally I had intended to write a think piece for the Sentry blog, but at some point I got bored (I’m all about honesty!) and got distracted with other projects). You may find it a valuable read, and it provides some direction in how we’re thinking about the future of Sentry’s data model.

When I started the Sentry project I had a very simple goal: make sure I know when I’ve fucked something up for our customers. The best products are not the best technology, the best design, or the best priced. They’re the ones where the customer experience - in every touch point - is a great one. We all make mistakes, software is routinely broken, but how we react and respond to those situations are what sets you apart.

Since then Sentry’s continued to grow in our ambition. We’ve spent more than a decade iterating on our error monitoring product, growing it to other runtimes, overcoming new ways of building technology, and most importantly adapting to the scale of the internet and its continued growth. That scale is becoming an ever increasing subject, and today I want to talk about how that is driving Sentry’s technology and product strategy, and what that means for our current and future products.

A History Lesson

When Sentry was first created the architecture was fully optimized to minimize cost in all areas. Most importantly, Sentry was founded on a sampling model, a value prop in which having duplicate copies of errors was not worth the cost of processing and storing them. The technology was designed to store 1 in N copies of the same issue (our term for a unique error). For example, we would store the first 100 instances of an error, then 1 in 2, then 1 in 5, 1 in 1000. Suffice to say it allowed us to bootstrap the business to great success.

Over time we had more and more customers asking more complicated questions of Sentry, like “for this error, which combination of tags were most common?”. That actually mattered to some of our most important customers, so we set out to solve it. Our solution was to create an efficient storage model that allowed us to archive and index every single event, allowing you to generate real time answers even if we didn’t curate them for you. While that solution has been great, it still only scales so far. It is inevitable to understand that sampling is a requirement of large data systems, and the growth of the internet continues to drive a lot of discourse around this in businesses like Sentry.

So ultimately we’re posed with the question: is it better to sample or to store everything? The answer is yes, to both.

The Art of Sampling

If you engage in discourse with a variety of other folks in the monitoring space you’ll find they have a very strong opinion on academically if you need to sample or not. The problem is that it’s not an academic question, but one that’s tightly coupled to a number of qualities about you, your product, and your business. Unfortunately the industry has historically decided it knows best - or simply makes the choice that is most economical for themselves - and that choice doesn’t always work out in your favor, even if you’re lucky to be told about what it is. The same is said for Sentry’s approach. Sampling was great for some customers when they (and we) cared about cost efficiency, but bad for others who had different needs, or a different budget.

If you’re a company like Walmart, maybe you only need a really small sample of data to understand where your problems are. If you’re Apple, maybe you rarely need any data because of an aggressive QA process. If you’re Twitter maybe it’s simply not high enough value to capture that for hundreds of millions of free users. The needs of every company are unique, and we should not settle for a generic solution to ensuring such an important piece of our business: the experience of our customers. Any product or tool that is deeply connected to the process of shipping your software - your business - should do exactly what you need it to do.

This is what we believe at Sentry, and while we recognize some other solutions in the space may work for people we’ve always prided ourselves on building a product our customers loved, in all shapes and sizes. More than 100,000 organizations trust Sentry’s error monitoring, and to bring our approach to other software health concerns, including performance monitoring, it requires us to resolve the scale issue, the dilemma of sampling.

Being Dynamic

When we think about what our customers want we’re burdened with choice. I used to half jokingly state to investors and new hires that if you picked a name out of the virtual phone book that is tech companies, that that company was one of our customers. That has only become more true over the years, and that means our solution has to work for every style of business and application. One of Sentry’s core values is “For Every Developer” - we are here to build a product that is suitable, useful, and loved by every single developer. So when we think about this with sampling we were left with only one option: to build a customizable, dynamic approach to sampling.

Our approach fundamentally focuses on flexibility. We think it’s important to give customers the choice on where they make their tradeoffs - fidelity and cost - and make them dynamically as their needs change. What this means in practice is we believe the approach to sampling needs to allow simple choices to be made:

  • What’s the minimum amount of data I need? Is a fixed 1% rate of transactions enough?
  • When will I need more data? When we release a new version of the product do we need more samples?
  • Are some customers or scenarios more important? Should we bring more fidelity within our higher paying accounts, our enterprise customers?

These are the kinds of situations we have heard time and again come up with customers, and they’re all scenarios that we will enable with on-the-fly server-based sampling decisions. There’s a lot of complexity in how we pull this off, and we’ll be sharing more in the future on what is possible, where the limitations arise, and how we’re going to keep investing to overcome those.

While there may be a future when AI drives real decisions, technology (as a whole) is not there yet, so we will be ensuring that this happens transparently with full control from customers.

Packaging It

One of our goals with Sentry has always been transparency. It’s a core to the open source tenets that empower Sentry’s product and business. Fortunately this has also become somewhat standard in our industry with the dominance of AWS’ utility-style pricing. When we launch sampling later this year we will keep with tradition - charge a fair price based on consumption, and keep it dead simple. We’ll do that by charging a small fee for every event processed on our server, and a slightly larger fee (similar to today) for any event that is stored and indexed.

To maximize the value out of discarded events, we’re going to be introducing Measures on top of the performance product. These will ensure you still retain high level visibility into your core health metrics. For example, we’ll give you full latency details (p95 et all) whether you do or don’t store the events. Details are TBD, but we’re also going to bring custom metrics to transactions ensuring you can track your other key transaction-coupled measurements.

Now the immediate question you are likely asking is how will this work for errors? How will this work for our upcoming Profiling functionality (via Specto)? The answer is we don’t yet know. Our intent is to offer this technology universally with Sentry data streams, but we want to spend more time exploring the value it offers in each data stream.

Beyond Sampling

Data scale is a core problem in what limitations apply to products, so it’s where we are focusing the conversation this year. For example, when Sentry sampled data we were not able to provide the Discover-like experience we have today. That experience is extraordinarily powerful, but it’s also not without cost. We believe by giving you real-time control over data sampling we’ll strike the right balance in allowing you to choose how valuable these data streams are to you, and it will still allow us to build an incredibly curated and versatile product experience.

That approach is how we built error monitoring - an experience that is now imitated by nearly every major APM vendor in the industry both big and small. That experience has always focused on being curated. It’s always focused on us saying “we know what an error is, and we’re going to show you what matters”. Things like Discover are a departure of that, but an intentional one. In the same way as the sampling decision matters, the ability to leverage your data when you want it also matters. That said, we recognize there are still other challenges with the “choose your own adventure” model, and performance monitoring ala distributed tracing is filled with these. We also recognize there is a large demand for fully utilizing the OTel format.

That said, I’ve already taken up enough of your time today, so that will be a conversation for next time.


© 2022 David Cramer — Archive