‘food-item’ search using recipe embeddings : A simple embedding based search engine using gensim, fastText and ElasticSearch

Arnab Borah
Towards Data Science
8 min readJan 4, 2021

--

This is an introductory lesson in building a search ML product. Where we’ll be using a tool (genSim) to train a language model (fastText), and then index the data onto a scalable search infrastructure (ElasticSearch), and write a custom search functionality to try the embedding based search.

Photo by Stefan C. Asafti on Unsplash

Purpose In this tutorial we’ll learn how to create a simple search application for a certain use case. Search is ubiquitous, every app has multiple of search bars and algorithms all serving a different purpose. Imagine you’re creating a food delivery app, or more specifically a cooking app, and you want to let your users post recipes and search your inventory real-time, and hence you want a search bar on the home-page as the first search bar which the user encounters when they come to your platform.

We’re doing this as a tutorial to learn. The code is available in my Github.

Notebook(s): https://github.com/arnab64/food-search-recipe-embeddings/tree/main/src

Goal Display a list of food-items given a query in decreasing order of similarity from the inventory.

  • Engineering : learn how to train a custom language model (Gensim), and deploy it in a scalable search infrastructure (ElasticSearch) fast.
  • Data : learn intricacies of handling this kind of text data, by this kind I mean data in the food restaurant domain, otherwise all data
  • Product / Science : learn how to evaluate the goodness of the solution, and which kind of architecture or what additional improved embeddings we can generate to perform better on the metrics. [upcoming-tutorial]

Data Data source: The dataset is a public domain dataset sourced from Kaggle. 6000 Indian Food Recipes Dataset

Data exploration and preprocessing: All the necessary preprocessing has been done as required for text fields, both in order to train the embeddings and to use them. More details are available in the notebook. I mainly used the two columns Ingredients and Recipe in order to be able to train word vectors on them. Stack used Gensim, ElasticSearch, Pandas

Cuisine histogram

recipeEmbeddings : a fastText language model on food recipes build using Gensim
Since this dataset is a recipe dataset, we can train a linguistic model on the recipes. Dishes are the result of execution of a sequence of certain steps using certain ingredients. In fact, recipes are made sequential structure which makes it good for sequential tasks on food.

Here, we are trying to build a food/dish-suggestion application, and we want embeddings which would do that. We are trying to suggest dishes, and we have the recipe of each. Hence, the nature of the input field is already in a sequential manner, the output we want will be a list of dishes in decreasing similarity. We can use embeddings trained on the recipes of the dish, and then represent each Food-Item/Dish using the embeddings of its constituent ingredients or recipe. We refer to these as food-item-embeddings.

Because, all food items will be uploaded to website by vendor only once, and since the context doesn’t change, these embeddings for every Dish can be precomputed once, and everytime we have a new linguistic model, and indexed for faster retrieval.

For our purpose, we train a fasttext model on the recipe column. For more details about training language models and to learn more on word2vec and fastText embeddings, check out this article.

For details on my implementation of all the stuff mentioned here check out my repository: arnab64/food-search-recipe-embeddings

Terms most similar to “paneer” (cottage cheese) in the trained LM

Now, what is our task?

  • suggesting a dish? : no explicit query, we can use user-embeddings based on their past orders
  • searching for a dish? : explicit query provided, create query-embeddings and suggest based on the distance

In this report I have performed the second task mentioned, i.e. given an explicit query at runtime, I want to use the embeddings to suggest food-items.

Some of the results of similar food-items, or as we call dish here, can be seen below. The training of the fastText model was very simple, and done in gensim.

Similar Dishes based on trained embedding 1) Homemade Easy Gulab Jamun 2) Kashmiri Style Chicken Pulao

How do we get to food-item-embeddings from word vectors? A recipe is a collection of instructions, and a instruction is a collection of words (ingredients, actions, etc). Now, we have trained our language model to be able to learn n-dimensional representations of a word.

recipe-vectors : Simple averaging over the word vectors of the words in the recipe, as I did not spend too much time on it.

You can do the same thing on just the title of the food-item, but a title does not provide as much information about the food, as a recipe. Hence, it might not be that useful. Nevertheless, I tried out other methods as well, but not mentioning here to keep it short.

Added food item-vectors based on recipeEmbeddings to existing data The newly computed vectors for every food-item based on it’s recipe and ingredients have been added as additional column in pandas. For non-user textual features like food-items it is best to store embeddings pre-computed, and update with change in data.

After we have the vectors in the dataset, we can use these vectors + the other existing features to perform a classification task, like here we can try to predict the type of cuisine given a name, and hence evaluate how good the embeddings are. But predicting cuisine doesn’t seem like an interesting problem to dedicate time to now.

Using fastText we are using only the contextual information and not the sequential information, i.e. these embeddings can most likely not differentiate if you added the onions after the tomatoes or vice versa. We could train our custom language model using attention based methods to capture the sequential information of recipes. Then we’d have to train the custom language model in torch or tensorflow and then use the model to add vector based embeddings. fastText is used to be able to train word vectors without wasting time, or we can use pre-trained word-embeddings trained on billions of params, and figure out a way to use them directly to power apps thus saving engineering effort and achieving similar results.

Photo by Umesh Soni on Unsplash

Indexed the food-dataset with the new dense-vectors onto ElasticSearch We are instead going to index the data into a ElasticSearch and explore the features there, and most importantly we wanted to build a search engine, ElasticSearch makes it super easy, and scalable, and with dense-vector based operations, many smart applications can be built which will be very fast. elasticSearch can handle almost all kinds of data, it’s massively scalable and fast, easy to deploy your search model to production, and experiment with.

Here, I indexed the data from pandas to ElasticSearch using the ElasticSearch client API in Python. Define the schema in python and wrote the custom data ingestion function from pandas to JSON, for all the columns needed in the index.

Dense-vector based Search on ElasticSearch Elastic search makes the data available to be searched over API once indexed. The embedding for the search query is derived in the same way as we do for a item with recipe. We read the pre-trained recipe word model and get embeddings for the words in the processed search query and take an average.

Observations The following results were interesting to see and proves a basic utility of the recipe data, and it can be combined with other datasets to make the embeddings and models better. Recipes follow a universal format, which globally, and in India recipes follow a structure. This is also a multilingual problem, as same recipes may exist in different languages on the web, and these kinds of embedding based methods make it easier to be able to build a application that works for any language, certainly helpful for apps in a multicultural market like India.

Search execution on Elasticsearch — example-1

My insights: This is a really small dataset that the embeddings are trained on (about 6000 instances), we might have real data which is huge and hence better, or trains smarter embeddings. However, despite using this we can build a workable search application which doesn’t look too bad.

Since we are directly using recipe based vectors for every food item search, I believe our application would work great when we provide search queries which are essentially ingredients. Like: “flour oil bake tomato cheese olive oregano” -TO- Pizza or breadsticks with tomato salsa etc.

For food name search, we might have to train another model mapping queries to recipes. “pizza” — “four cheese and mushroom pizza, chicken pesto pizza, Cheesy muffin”

Sample suggestions for query.

Here (example-3) we see see that we have retrieved a lot of soup dishes for multiple cuisines.

Users don’t list ingredients are queries, they are more likely to search for the name of the dish. This method works for that too but our embeddings aren’t designed to be optimized for that.

For example, the results for “chicken tandoori” look like this.

[‘Paprika Chicken Skewers’,

‘Baked Paneer Corn Kebab’,

‘Chicken Tikka Taco Topped With Cheesy Garlic Mayo’,

‘Baked Fish Crisps (Fish Fry In Oven)’,

‘Beetroot Chicken Cutlets’,

‘Chicken Malai Kabab’,

‘Potato Roulade’,

‘Rosemary And Thyme Chicken’,

‘Spicy Kiwi Salsa with Feta Cheese’,

‘Crispy Vegetable Tempura’]

The results aren’t bad, and actually quite diverse but similar.

Hence, to be able to use food name based search, we’ll just have to add another model that maps from search queries to unique “food-recipes”, and maybe even use a sequential kind of training.

Evaluation and next steps As of now I do not have a way to evaluate the goodness of these vectors. I left the work at this point. But this was a good introductory learning experience to know about the data science problems in apps dealing with food. I’m pretty sure the actual problems are a lot more complicated and at a bigger scale.

As for next steps, I will try to combine other datasets, figure out approaches to come up with better way of embedding information and solve another task. Feel free to contribute or connect for exciting stuff.

Thank you!

--

--

Machine Learning Engineer @eBaySearch. I solve research problems in ranking, advertising, monetization.