The Cocktail Embedding
I got into cocktail making during the pandemic lockdowns. I have a little library of cocktail books, pretty heavy on the Death and Company books, and assorted Tiki nonsense. Once you make a bunch of cocktails, you start to get an idea of what sort of structure they have, and so where you can swap things out or make substitutions. For instance, if there’s a sour/acid component like lemon juice, there’s likely to be an approximately equal volume of sweetener like orgeat or simple syrup. Another thing is that there are common backbones with embellishments on them to make different drinks. A Margarita is tequila, triple sec, and lime juice. If you don’t have tequila, use cognac and you have a Sidecar. A Daquiri is pretty similar, but with rum as the spirit and simple syrup as the sweetener. Swap whiskey for the rum and you have a Whiskey sour (although probably with lemon, not lime). Swap gin for the rum, and it’s a Gimlet. Put some mint and Angostura bitters in there, and you’re drinking a Southside (and it’s way better).
At any rate, there are certain regularities in the structure of drinks. Part of this is due to what’s palatable, part of it is probably fashion, and there is also history and economics in there. Saint Germain is lovely stuff, but it’s also relatively modern, so you won’t find it in old cocktail books. There also aren’t a ton of scotch cocktails because it has historically been viewed as A) not for mixing (aside from a few drops of Scottish spring water, maybe), and B) expensive. Not putting expensive ingredients in cocktails is for weirdos, the stuff was made to be drank, maybe drink it ya goober.
However, none of this historical or cultural context is able to be rolled into questionable machine learning plans and schemes, so I’ve recently been pillaging cocktail recipe web sites for their recipes using web spiders. I have thousands of recipes, and I’m planning to use it as a data set to train ML models to do a couple of different things.
One basic thing is clustering. Cocktails can be imagined as points in a high-dimensional space. The number of dimensions is quite large, because each ingredient is a dimension, and the amount of each ingredient is the distance from the origin in that dimension. One thing that I’m probably going to have to do in the interests of comparisons is normalizing the units of the cocktails, since some are in ounces, some are in milliliters, and some of the ingredients are in amounts like “barspoons” and “dashes”. That transformation will put them all in the same coordinate space. Normalizing them so that the ingredients are proportional to the total drink then converts them to points within the unit hypersphere. Because Cartesian distance is pretty ill-behaved in high-dimensions, cosine similarity will probably be a better measure for clustering, but it will be interesting to see what comes out from that process.
Before I can do any of that, though, I have to clean up the data, and that’s turning out to be its own interesting problem. One aspect of it is that liquors are themselves kind of complicated. Gin, for example, is pretty variable in taste. Hendricks Gin is fine stuff, but if you don’t want your drink to taste like cucumber, don’t use it. Most gin has piney/citrus notes, but the other herbs and spices in it cause a lot of variation around that theme. Some cocktail recipes (specs, in the industry) call for a specific spirit. Some web sites complicate this by allowing you to enter your bar’s contents, and then swapping analogous spirits out for the ones you already have. This works for some spirits, sort of, but it has limits.
As a result, one of the more complicated parts of normalizing the data set is going to be figuring out where the boundaries are between types of alcohol that can be substituted for each other, and types that can’t. This is a form of dimensionality reduction, because collapsing all of, for example, the different brands of gin into a single “gin dimension” will remove dimensions from the dataset. More importantly, it removes dimensions that are effectively the same dimension, but would be extremely sparse, with each brand probably only represented a handful of times.
The problem is, not all spirits in a class are analogous. Kentucky bourbon and Islay scotch are exactly the same thing, in the sense that they are fermented from a grain mash, distilled to somewhere around 80% alcohol, and then barrel aged. Say that in Scotland or Kentucky, though, and you best start running. Bourbon requires 51% or more corn in the mash bill (list of grains in the mash), and is aged in new containers made of oak and charred on the inside. Once the barrel is used, it can’t be reused for bourbon, and so there are a lot of them going cheap. This is why you can’t throw a bottlecap without hitting a local microbrewery producing bourbon-barrel-aged something-or-other. Legally, whiskey imported from outside the USA can’t be sold as “Bourbon”. Scotch mash starts with malted barley, but then breaks down further in to single malts, blended malts, and a bunch of other categories, depending on how the mash was set up and whether it was mixed after aging. You don’t have to char the barrels, and you can reuse them. In fact, some scotches are aged in barrels that previously held other spirits, like sherry or port. As a result of all this messing around, the resulting spirits taste wildly different. Maker’s Mark is has a sweet, honey/floral and vanilla thing going on, while Laphroaig tastes like you’re watching an evening storm roll in on a cold, seaweed-strewn ocean beach while someone sets tires on fire to keep warm. Even within scotches, there’s a ton of variation, so if a spec calls for Cragganmore vs. something like Lagavulin, they’re aiming for a particular flavor profile.
My instinct at this point is to split or lump scotches by region, other whiskeys by bourbon/rye split, and tequilas by age (blanco/reposado/anjeo). Gins are even weirder, so I’m not sure what to do there, aside from keeping flavored ones (usually orange or lime, although Hendricks’ cucumber also counts) distinct from “plain” ones. Rums are probably the weirdest of all, despite all being distilled sugarcane spirit. There are different styles of still, different aging processes, and the large number of fairly weird flavored things out there, like Malibu (which I might just delete on sight for being vile). Vodka is boring, it’s largely required to be flavorless, although there are texture and smoothness aspects to it. Where it’s going to be a problem is, again, flavored nonsense.
In terms of tackling the project, the first thing I’m going to do is get the various files into a standard format for representing ingredients and their amounts. At that point, I can also standardize the amount units, probably to ounces since that’s what my jiggers are calibrated it.
Once everything is standardized in representation, I can also unify the datasets and create a derived dataset with the amounts converted to proportions, but this raises another problem: cocktail names. I feel like that is actually two problems: variations of a cocktail, and people reusing names for different drinks. There is a cocktail in the Savoy Cocktail book that is called an Aviation, and has most of the ingredients that are also in the Death and Company Aviation, but the amounts are different, and the D&C Aviation is a far superior drink. I fully expect, however, that there are different drinks with the same or very similar names in the dataset. At the moment, I’m thinking that the solution here is not to treat names as special, and especially not to use them as keys in any kind of data structure.
What I Want to Do with the Data
One thing that a statistical model of cocktail ingredients is good for is reverse-engineering the ingredient amounts from cocktails at a bar. Bars will typically list the ingredients, but not the amounts, and so if you want to replicate the drink at home, you’re out of luck. Given a list of ingredients, it should be possible to predict the likely proportions of them in a cocktail. As mentioned previously, sour/acid components frequently have a balancing sweet ingredient and so on. I suspect that the solution to this is some kind of Bayesian constraint solver, where it has to pick out the amounts that are most likely, conditioned on the other amounts and ingredients, and with the constraint that the total volume doesn’t go over what you can fit in a cocktail glass. If you want to drop the constraint, just work with proportions and then put in 1oz of base spirit and calculate the others from there.
Another possible application is finding the drinks that don’t exist, but should, or the drinks that really really shouldn’t exist. A drink that doesn’t exist but should is one whose hypervector representation puts it in a void in a cluster. I think the representation of the problem there is that it should maximize dissimilarity with everything in the cluster, while not being so unlike them that it ends up in another cluster. Drinks that shouldn’t exist are ones that maximize distance from any cluster, subject to the constraint that the total volume is still reasonable. The volume constraint is because one gallon of lime juice with a pint of bitters is one way to end up very far from all other drinks along the “lime juice” dimension. Another constraint is that a drink probably shouldn’t have more than 5-8 ingredients in it, although the exact number is something I will probably derive from the data.
An even sillier proposition is to train a neural network to produce names from recipes, or vice versa. This is silly because it creates a latent space that continuously maps sequences of English letters to plausible combinations of cocktail ingredients, and so can also create a cocktail for “England” or “Beatrix Potter”. It would be interesting to see if a “Southcar” or a “Sideside” were similar to each other or to the Southside and Sidecar. Going the other way is also amusing, because it can name cocktails based on ingredients. I’m slightly curious how the name of a Margarita changes as you add increasing amounts of blackberry liquor to it.
Recent Comments