cra
mr

CTA: Structuring Unstructured Data

I have a challenge. It’s one thats somewhat unique to my project, but also an absolutely generalized challenge in the industry. In fact, the same challenge exists at Sentry. That challenge is fingerprinting loosely structured data to deduplicate entities. Let me explain.

Sentry takes errors, highly structured ones, with lots of human curation (read: denoising and normalization), and then fingerprints them together. It does this through a series of rule-based heuristics that we have built and maintain over the years.

That same challenge exists all over the industry. For example, on my side project (Peated), I’m trying to ingest a bunch of data from third party websites that contain information about bottles of whisky and different distilleries. That information is often entered by a human, and has slight variances from one vendor to the other.

In both scenarios I want to take some fuzzy information, de-noise it, and output highly structured information. I’m trying to understand the state of the art here - could we leverage something like embeddings to take the human load off of building these rule-based matchers?

Peated is my sandbox, so here’s the exact problem.

I have a bottle label. The components within this label vary, and sometimes contain pure noise. My objective is to normalize that label into a set of structured attributes (e.g. turn it into a SQL row), particularly with an absolute success rate if the matching object already exists in the database.

Here’s an example, with there’s three things I can easily identify as a human:

Macallan 12-year-old Sherry Cask

  1. The brand is “Macallan”.
  2. The age of the spirit is “12” years old.
  3. The label, or effectively the name of the bottle, is “Macallan 12-year-old Sherry Cask”. For the sake of my case, I remove segment the brand from the bottle name.

Now these segments are not always present, and they can contain even more information. Here’s another, more tedious variation.

Aberfeldy 18-year-old Single Malt Scotch Whisky

  1. The brand is “Aberfeldy”
  2. The age of the spirit is “18” years old.
  3. The spirit is classified as “Single Malt”
  4. It’s Scotch Whisky, which is noise.
  5. Because of the given components, I’d label the bottle as “Aberfeldy 18-year-old”.

There’s some things to understand about this, as you cannot run an untrained model to possibly give the above.

First off, we know every brand. That means we’re already able to separate the label name from the brand name which is a huge advantage. Now while there might be new brands that exist that we’ve never seen before, we can safely say that in those cases we want a human to review.

Second, we know the majority of classifications. We know “Single Malt”, for example. Additionally things like the age are highly predictable and only written in a few ways.

That means above its actually fairly easy to parse this out, if you exclude the aspect of the “Scotch Whisky” label noise. Obviously those are of somewhat limited nature and you could parse them out, except that leads us to our next, more challenge example…

Aberfeldy 15-year-old Limited Edition Single Malt Scotch Whisky Finished in Napa Valley Cabernet Sauvignon Casks

Most of this is noise, but which parts? It very quickly gets to be a mess, and what you really want is the lowest common denominator of how humans would input that data, but with a bit of guaranteed decisions on it.

Beyond that there’s a few other problems even with the basic heuristics, such as this one:

Old Trestle Double Barreled Bourbon

  1. The brand is “Old Trestle”.
  2. The spirit is classified as “Bourbon”.
  3. I’d say the bottle name is “Old Trestle Double Barreled Bourbon”. We don’t actually want to remove the spirit classifier in this case.

There’s a LOT of them that look like this, and it goes into even more problematic examples.

Ichiro’s Malt 2013

  1. The brand is “Ichiro’s Malt”
  2. The Vintage Year is “2023”.
  3. The bottle name is “Ichiro’s Malt 2013”

This is problematic because we’re subjectively (somewhat) deciding which components exist in the bottle name, and its an actual branching decision tree:

  • Is there a name if we remove all non-label components (vintage year, classifier, etc)? (the above would not be labelable without the vintage year)
  • Does the name make sense if we remove the classifier? (Double Barreled isn’t a bottle name…, as there’s also “Double Barreled Rye”)
  • We always want to remove as many of these duplicate attributes as possible, otherwise bottle names are incomprehensible from each other.
  • Some inventories add huge amounts of junk to the label, so a pre-defined list of things to filter isn’t effective enough.

So that’s where we are. I have a dataset from several different vendors, and you can see the highly variable inputs that come out, and you can easily have a human look at that with the above decision tree (+ probably some others I didn’t note), and find the discrepencies.

This is enough outside of my wheel house that I dont quite grok the art of the possible, but I believe the approach itself can apply to so many problem domains. If this probably interests you I’d love to have your help. I dont have a big budget, but happy to try and compensate you for your time. Shoot me a DM on Twitter (presently known as X) if you’re interested in helping!

More Reading

Open Source is not a Business Model

wtb: Progressive SPAs

The Problem with OpenTelemetry

You're Not a CEO

Enterprise is Dead