Note: This was an internal post I made at Sentry, but given this challenge is quite common at early stage startups, I’ve decided to publish it externally. This is an opinion and a decision on how we operate at Sentry. I won’t claim that we’re experts on this, but all companies are somewhat unique in their culture, their goals, and the methods they choose to adopt. This is just one of our choices.
If you’ve interacted in the product strategy sessions, you’ll find I’ve historically been opposed to A/B testing - to running behavioral experiments. Today we’re going to talk about why that is, and why we are moving away from A/B tests in product. Before we get to that, I want to talk about why making this decision matters, and how we’ve been down this road before.
Early in Sentry’s post-funding days there was a constant battle: should we sell on-premise Sentry or only offer SaaS? I, and much of the founding team, were opposed to doing on-prem. There were a variety of reasons, but what matters is our indecisiveness. At some point - far too late into these conversations - I said enough was enough. I made the decision that we would not commercialize on-premise Sentry, and that the decision was final. On-premise has never come up again, and we are better for it. That distraction, that paralysis to move forward, it risked stagnation. It created an inability to focus on what we needed to be doing.
This leads me to the conversation about testing. This same indecisiveness is what I see in many conversations around testing, and wrongly thinking testing is a treatment for the problem.
If you’re not familiar with A/B testing, it’s primarily used in performance marketing as a way to optimize for human behavior. For example, if someones searching for a “dog collar”, what language or visual is more likely to draw the individual’s attention? That is often hard to know and it varies greatly. Importantly, those variations at large scales can also have a meaningful impact. You sell both a happy, fun, bright pink dog collar and moody, covered in spikes, black dog collar, and one of those is more likely to get the average individual’s attention. You may also find that there is variance in which one matters most based on the context. The reason this all matters is ultimately because you sell one million dog collars, and a lift of 1% could be 10s of thousands in additional (or saved) revenue. Multiply those decisions, and you have the reason testing is so valued in performance marketing.
Now you might ask, why would I be opposed to such a thing within our product? There are many, many reasons I am opposed, but the one we should care about is how it fosters a culture of decision paralysis. It fosters a culture of decision making without having an opinion, without having to put a stake in the ground. It fosters a culture where making a quick buck trumps a great product experience. That goes wildly against our core values, how we built Sentry, and what we want Sentry to be. Pixels Matter, one of our core values, is centered around caring about the small details, and that by its very nature is subjective. What details? Which ones matter? Those decisions all center on taste, and around someone making a subjective decision.
While that single story is what is driving this push, I do want to touch on several other arcs where I have seen testing fail, create friction, or otherwise cause a general negative sentiment amongst folks.
Small targets drive small outcomes. An example might be targeting a small test segment of 100 customers to push Replay. If one customer adopting the product drove ACV of $300, that means we’re targeting a whopping $30,000 in added annual revenue. Contrast that to your salary and multiply that by 400. That’s a small target, a small outcome, and it’s not even guaranteed! On the counter side, you’ll note we’ve set our target for Replay adoption to 10% of our audience this year. That’s a big target, and it might not be easy to hit, but hitting it (or even getting close) will have a meaningful impact to our ARR.
Bringing this back to testing, often tests are focused on small targets, on the tiniest of incremental change. This is by design! It’s to ensure you can identify what is actually driving success. Unfortunately now we’re back to small outcomes. “Which text for the button will perform the best?” - well if only a few hundred people click that button in a month, the answer is it doesn’t fucking matter. Additionally a few hundred clicks is not going to be enough data to achieve a statistically significant result. While I realize this anecdote is extreme, it’s representative of the general problem.
Opportunity Cost. This is an economics theory that you’re likely familiar with. If we spend time doing one thing, it means we are not doing another. This is a key principle we apply in all decision making, but is compounded when it comes to testing. If you’re testing something you’ve intentionally decided to add cost, complexity, and time to the initiative. To make matters worse, testing requires you to have a control in place, and practically speaking that means you avoid making multiple changes within the same sphere of influence (for the duration of the test) to maximize correctness. That means you’ve doubled down on the time spent doing one thing vs another.
Alternatively we simply make decisions based on the information at hand and measure the results. This means we unblock future development as quickly as we can, but it doesn’t mean we won’t act on the results. If the results perform negatively, we may need to try something else. When they do, you of course want to understand why, but fortunately there are still many ways to do post-analysis. Cohort analysis is particularly useful in this fashion. This is not to say cohorts are a superior technique, as they are generally going to be less accurate, but its usually enough for you to draw conclusions from, and if needed, you can bisect from there. Again, and most importantly, you’re not blocking additional development on the result of the test, which can often take a considerable amount of time, nor are you spending cycles waiting for information before moving on to the next project.
A good example of this in practice is Sentry Bundles. We launched the bundles to only new customers (this limits exposure and risk), and we’re able to measure the cohort of bundles vs prior cohorts of non-bundles. Are there other changes that might impact the results here? Sure, but that impact - from experience and raw data points - is minimal, especially because our period of observation is only 4 weeks long. You could do this with A/B testing as well, but what would you gain? At best an accuracy improvement, but more likely the data would be insignificant (we only get a few thousand paid customers per month in total). We could also A/B test different price points, but we’d rather have an opinion on what the right package and price point is, and prove if that’s going to work or not. If it doesn’t, we’ll take those learnings and iterate on a v2. Our focus is on narrative and customer experience, not min/max on the bottom line.
Experiments as a substitute for data. This issue really hits home as it has caused measurable financial consequences at Sentry. I’ve often seen this happen by wrongly concluding that data is only measured by the outcome of the project, rather than learned or otherwise informed by past experiences. This is where we begin to really articulate what taste ultimately is. This is one of the biggest reasons I am against A/B testing, the reason I push for people to have an opinion, and this one we’re fortunate to have a recent example we can learn from at Sentry.
Some months back we changed the New Project flow to create multiple projects instead of one. The thesis here was that customers have multiple projects, so we should prompt them to set them up all right away to improve expansion. If this triggers your spidey sense, it should! One of the pieces of feedback we had when this idea was pitched is that users generally are not working on multiple applications at the same time, so prompting them to set up multiple applications would increase friction. This feature was implemented, it was run through a controlled A/B test, and that test suggested the multi-project creation was better. Was the feedback wrong then?
Months later we noticed an activation issue, and lo and behold we found there were a lot of accounts where they had many empty projects. Reasoning would suggest that yeah, obviously this would create issues, because instead of an account having one project, which is fairly easy to navigate and comprehend, accounts now had multiple projects, many of which aren’t set up. You combine that with the fact that most customers are on our Team plan, which does not allow cross-project search, and you can easily understand why the user experience is bad. Data is not a substitute for critical thinking. The test said it was successful, but the outcomes, which is what you base future decisions on, showed otherwise. The lesson here is not that A/B testing was at fault, but that running experiments is never a substitute for a vetted hypothesis.
A/B tests are useful to the extent that human behavior is aligned with a definition of correctness. The last issue I’ll touch on goes back to the purpose of A/B testing, and our misuse of it at Sentry. There are two common places I’ve seen us go wrong with this at Sentry. The first I mentioned several times above, where we’re commonly lacking enough data points to reach an accurate conclusion (very often true in Enterprise software, and will be true in most testing methodologies for us). The second is far more nuanced, but is key to why this is valuable in performance marketing and less so in engineering. A/B testing is testing human behavior, it is not testing correctness, yet we’ve tried to lean on this concept historically to validate if some code we’ve written is more correct.
A contrived example of wrongly using A/B tests would be to improve our issue fingerprinting behavior. Our goal with issues is to ensure the same error is always grouped together, but we sometimes get it wrong. Changing the way we fingerprint however is fairly scary, and hard to measure if its better or worse than the previous iteration. You might think A/B tests can help here, but a more correct approach is how many in the ML space solve with verification data. A really simplistic way to think about this is that you’d build a test data set - think about a huge spreadsheet of issues - and you’d have that spreadsheet tell you exactly which issues should and shouldn’t be grouped together. When you change the heuristics, the algorithm to fingerprint issues, you’d validate your results by measuring the accuracy (the number of these sample issues grouped together) vs your verification data set (the target grouping of data).
In real life it’s much more complex than that, but that is actual verification of a problem. If you A/B tested this, you’d introduce variability through subjective human behavior. It’s possible you’d randomly get lucky here, and it’s more likely you’d get lucky with a huge amount of data points, but it doesn’t mean the methodology is correct. It’s also importantly not measurable in a way that lends itself to repeatability - for example if you wanted to continue to iterate on algorithm improvements (and factually know that it’s more correct). You may grok that this method of verification is more complicated, more expensive, and that’s just a natural side effect (and requirement) of the scale of Sentry. tl;dr OpenAI is not A/B testing which LLM models work better.
“No A/B testing is certainly better than bad A/B testing”
Our objective at Sentry isn’t, and hasn’t ever been to be great at running this kind of experimentation, which may explain why many of these attempts have been more failure-prone. That is why we are choosing to continue to prioritize taste at Sentry. That taste is curated through hiring, and comes from the team’s domain expertise, their diversity of background, and their learnings both building and using our product. It comes from talking with customers on Twitter, from engaging them in support tickets and on Github, it comes from direct transparent conversation.
This is the kind of data we use to inform our decisions here at Sentry. Sometimes those decisions, those subjective-but-informed decisions, will lead to a failure. Those failures help us make better decisions in the future. It means we are optimizing for iteration speed - another one of our core values. That willingness to iterate, to fail and move forward quickly, that is what drives great outcomes, and that is the culture we want at Sentry.
I’d like to leave you with one lesson I’ve kept with me over the years. The strength of making a decision is making it. You can always make a new one later. Choose the obvious path forward, and if you don’t see one, find someone who does.