编程知识 cdmana.com

Build recipe recommendation API using scikitlearn, nltk, docker, flag and heroku

Pan Chuang AI Share

author | Jackmleitch

compile | VK

source | Towards Data Science

My idea is : Here's a list of ingredients , What different recipes can I make ? in other words , What recipes can I make from the food in my apartment ?

First , If you want to see my API( Or use it !) Please follow the steps below :

  • https://whats-cooking-recommendation.herokuapp.com/- If you're in America
  • https://whatscooking-deployment.herokuapp.com/- If you are in Europe
  • If you're somewhere else , Either way , It's just a little slower

I apologize for the lack of beauty , At some point , When I have time to do it , I'll build a better app .


In my first blog post on this project , I reviewed how I collected data for this project . The data is the recipe and the ingredients . after that , I added more recipes , So we have a total of 4647 individual . Please feel free to use this dataset , You can stay in mine Github Find it on the :https://github.com/jackmleitch/Whatscooking-

This article will focus on data preprocessing , Build a recommendation system , Finally using Flask and Heroku Deployment model .

The process of establishing the recommendation system is as follows :

Firstly, the data set is cleaned and analyzed , Then the digital features are extracted from the data , On this basis, similarity function is used to find the similarity between ingredients of known recipes and ingredients given by end users . Finally, according to the similarity score , Get the best recommended recipe .

Unlike the first article in this series , This article is not a tutorial on the tools I use , But it will describe how I built the system and why I made such a decision . although , Code annotations seem to me to explain some things very well . Like most projects , My goal is to create the simplest model , To make the work to the standard I want .


Building recipe recommendations API

Pretreatment and analysis of components

To understand the task at hand , Let's take an example .Jamie Oliver Delicious food on the website “Gennaro's classic spaghetti carbonara” The recipe requires the following ingredients :

  • 3 Big yolks
  • 40 Parmesan cheese
  • 1 x 150 g High welfare Italian Bacon
  • 200 Macaroni
  • 1 Clove garlic
  • Extra virgin olive oil

There's a lot of redundant information here ; for example , Weight and correlation measures do not add meaning to the vector coding of recipes . If there is any difference , This will make it more difficult to distinguish recipes . So we need to get rid of those things . After a quick search on Google , I found a Wikipedia page , There is a list of standard cooking indicators , Like cloves 、 g (g)、 Teaspoons and so on . It works very well to delete all these words in my ingredient analyzer .

We also want to remove stop words from our constituents . stay NLP in ,“ Stop words ” It refers to the most common words in a language . for example , The sentence “learning about what stop words are” Turned into “learning stop words”.NLTK It provides us with a simple method to delete ( Most of the ) These words .

There are also some words in the ingredients that are useless to us —— These words are common in recipes . for example , Oil is used in most recipes , And there's almost no difference between recipes . and , Most homes have oil , So every time you use API We have to write oil , It's troublesome and pointless .

It seems very effective to simply delete the most common words , So I did . Occam's razor principle … In order to get the most common words , We can execute :

import nltk
vocabulary = nltk.FreqDist()

#  I have finished the pretreatment of raw materials 
for ingredients in recipe_df['ingredients']:
    ingredients = ingredients.split()
    vocabulary.update(ingredients)
    
for word, frequency in vocabulary.most_common(200):
    print(f'{word};{frequency}')

however , We have one last obstacle to overcome . When we try to remove these from the ingredient list “ The garbage ” Word time , If the same word has different variants , What will happen ?

If we want to get rid of “pound” Every occurrence of the word , But the ingredients in the recipe say “pounds” What do I do ? Fortunately, , There is a fairly simple solution : Form reduction and stem reduction . Both stem reduction and form reduction will produce root changes, and the root form of the word will change , The difference is that the result of stem reduction may not be a real word , The result of morphological reduction is an actual word .

Although word form reduction is usually slow , But I chose to use this technology , Because I know the actual words are very useful for debugging and visualization . When a user goes to API When ingredients are provided , We will also restore the form of these words

We can put it all in one function component_parser in , And other standard pretreatment : Remove punctuation , Make everything lowercase , Unified coding .

def ingredient_parser(ingredients):

    #  Measurement and common words ( Has been restored by word form )
    measures = ['teaspoon', 't', 'tsp.', 'tablespoon', 'T', ...]
    words_to_remove = ['fresh', 'oil', 'a', 'red', 'bunch', ...]
    
    #  Converts a list of components from a string to a list 
    if isinstance(ingredients, list):
       ingredients = ingredients
    else:
       ingredients = ast.literal_eval(ingredients)
       
    #  Let's first remove all punctuation 
    translator = str.maketrans('', '', string.punctuation)
    
    #  initialization nltk Of lemmatizer
    lemmatizer = WordNetLemmatizer()
ingred_list = []
    for i in ingredients:
        i.translate(translator)
        
        #  We separate them with hyphens and spaces 
        items = re.split(' |-', i)
        
        #  Change everything to lowercase 
        items = [word for word in items if word.isalpha()]
        
        #  A lowercase letter 
        items = [word.lower() for word in items]
        
        #  Unified coding 
        items = [unidecode.unidecode(word) for word in items]
        
        #  Morphological reduction , So we can compare 
        items = [lemmatizer.lemmatize(word) for word in items]
        
        #  Delete stop words 
        stop_words = set(corpus.stopwords.words('english'))
        items = [word for word in items if word not in stop_words]
        
        # # Avoid measuring words / The phrase ,  for example . heaped teaspoon
        items = [word for word in items if word not in measures]
        
        #  Delete common simple words 
        items = [word for word in items if word not in words_to_remove]
        if items:
           ingred_list.append(' '.join(items))
           ingred_list = ' '.join(ingred_list)
    return ingred_list

When we analyze “Gennaro’s classic spaghetti carbonara’” The composition of , We get :egg yolk parmesan cheese pancetta spaghetti garlic. Great , fantastic !

Use lambda function , It's easy to parse all the ingredients .

recipe_df = pd.read_csv(config.RECIPES_PATH)
recipe_df['ingredients_parsed'] = recipe_df['ingredients'].apply(lambda x: ingredient_parser(x))
df = recipe_df.dropna()
df.to_csv(config.PARSED_PATH, index=False)

The extracted features

We now need to update each document ( Recipe ingredients ) Encoding , Same as before , Simple models are very effective .

It's going on NLP when , One of the most basic models is word bag . This requires the creation of a huge sparse matrix to store the corresponding number of words in our corpus ( All documents , That is, all the ingredients in each recipe ).scikitlearn Of countVector There is a good implementation .

The word bag works well , but TF-IDF( Term frequency reverse document frequency ) The execution was a little poor , So we chose this . I'm not going to elaborate tf-idf How it works , Because it has nothing to do with blogs . As usual ,scikitlearn There is a good implementation :TfidfVectorizer. then , I use pickle The model and code are saved , Because every time I use API Retraining the model makes it very slow .

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
import config

#  Load the parsed recipe dataset 
df_recipes = pd.read_csv(config.PARSED_PATH)

# Tfidf need unicode or string type 
df_recipes['ingredients_parsed'] = df_recipes.ingredients_parsed.values.astype('U')

# TF-IDF Feature extraction program 
tfidf = TfidfVectorizer()
tfidf.fit(df_recipes['ingredients_parsed'])
tfidf_recipe = tfidf.transform(df_recipes['ingredients_parsed'])

#  preservation tfidf Model and coding 
with open(config.TFIDF_MODEL_PATH, "wb") as f:
     pickle.dump(tfidf, f)
     
with open(config.TFIDF_ENCODING_PATH, "wb") as f:
     pickle.dump(tfidf_recipe, f)

Recommendation system

The application consists only of text data , And there is no rating type available , Therefore, the matrix decomposition method cannot be used , Based on SVD And the method based on correlation coefficient .

We use content-based filtering , Enables us to select attributes based on user supplied attributes ( composition ) Recommend recipes to people . To measure the similarity between documents , I used cosine similarity . I've tried to use Spacy and KNN, But cosine similarity in performance ( Ease of use ) We won in this respect .

Mathematically speaking , Cosine similarity measures the cosine of the angle between two vectors . I chose to use this similarity measure , Even if two similar documents are far away from each other in Euclidean distance ( Because of the size of the document ), They may still be closer .

for example , If the user enters a large number of ingredients , Only the first half matches the recipe , Theoretically , We still deserve a good recipe match . In cosine similarity , The smaller the angle , The higher the cosine similarity : So we tried to maximize that score .

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
import config
from ingredient_parser import ingredient_parser

#  load tdidf Model and coding 
with open(config.TFIDF_ENCODING_PATH, 'rb') as f:
     tfidf_encodings = pickle.load(f)
with open(config.TFIDF_MODEL_PATH, "rb") as f:
     tfidf = pickle.load(f)
     
#  Use ingredient_parser Analysis of ingredients 
try:
    ingredients_parsed = ingredient_parser(ingredients)
except:
    ingredients_parsed = ingredient_parser([ingredients])
    
#  Using our pre trained tfidf The model encodes the input components 
ingredients_tfidf = tfidf.transform([ingredients_parsed])

#  Calculate the cosine similarity between the actual recipe and the test recipe 
cos_sim = map(lambda x: cosine_similarity(ingredients_tfidf, x), tfidf_encodings)
scores = list(cos_sim)

then , I wrote a function get_recommendations, Rank these scores , And output a pandas Data frame , Including before N All the details of a recipe .

def get_recommendations(N, scores):
    #  Load recipe dataset 
    df_recipes = pd.read_csv(config.PARSED_PATH)
    
    #  Sort the scores , obtain N Highest scores 
    top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:N]
    
    #  stay dataframe Create suggestions in 
    recommendation = pd.DataFrame(columns = ['recipe', 'ingredients', 'score', 'url'])
    
    count = 0
    for i in top:
        recommendation.at[count, 'recipe'] = title_parser(df_recipes['recipe_name'][i])
        
        recommendation.at[count, 'ingredients'] = ingredient_parser_final(df_recipes['ingredients'][i])
        
        recommendation.at[count, 'url'] = df_recipes['recipe_urls'][i]
        recommendation.at[count, 'score'] = "{:.3f}".format(float(scores[i]))
        
        count += 1
    return recommendation

It is worth noting that , There is no specific way to evaluate the performance of the model , So I had to evaluate the recommendations manually . however , honestly , It's really interesting … I also found a lot of new recipes !

up to now , My refrigerator / Some of the things in the cupboard are : Ground beef 、 pasta 、 Ketchup 、 Bacon 、 Onions 、 Zucchini and cheese . The recommendation of the recommendation system is :

{ "ingredients" : "1 (15 ounce) can tomato sauce, 1 (8 ounce) package uncooked pasta shells, 1 large zucchini - peeled and cubed, 1 teaspoon dried basil, 1 teaspoon dried oregano, 1/2 cup white sugar, 1/2 medium onion, finely chopped, 1/4 cup grated Romano cheese, 1/4 cup olive oil, 1/8 teaspoon crushed red pepper flakes, 2 cups water, 3 cloves garlic, minced",

  "recipe" : "Zucchini and Shells",  
  
  "score: "0.760",
 
  "url":"https://www.allrecipes.com/recipe/88377/zucchini-and-shells/"
}

That sounds good - It's better to cook !


Create a API To deploy the model

Use Flask

that , How can I provide the model I build to end users ? I created a API, It can be used to input ingredients , Then according to these components, output before 5 A recipe suggestion . To build this API, I use the Flask, It's a microcosm web Service Framework .

# app.py
from flask import Flask, jsonify, request
import json, requests, pickle
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from ingredient_parser import ingredient_parser
import config, rec_sys

app = Flask(__name__)
@app.route('/', methods=["GET"])
def hello():
    #  This is us. API The home page of 
    #  It can go through http://127.0.0.1:5000/ visit 
    return HELLO_HTML
    
HELLO_HTML = """
     <html><body>
         <h1>Welcome to my api: Whatscooking!</h1>
         <p>Please add some ingredients to the url to receive recipe recommendations.
            You can do this by appending "/recipe?ingredients= Pasta Tomato ..." to the current url.
         <br>Click <a href="/recipe?ingredients= pasta tomato onion">here</a> for an example when using the ingredients: pasta, tomato and onion.
     </body></html>
     """
     
@app.route('/recipe', methods=["GET"])
def recommend_recipe():
    #  Can pass http://127.0.0.1:5000/recipe visit 
    ingredients = request.args.get('ingredients')
    recipe = rec_sys.RecSys(ingredients)
    
    #  We need to convert the output to JSON.
    response = {}
    count = 0    
    for index, row in recipe.iterrows():
        response[count] = {
                            'recipe': str(row['recipe']),
                            'score': str(row['score']),
                            'ingredients': str(row['ingredients']),
                            'url': str(row['url'])
                          }
                          
        count += 1
        
    return jsonify(response)
if __name__ == "__main__":
   app.run(host="0.0.0.0", debug=True)

We can do this by running the command python app.py To start up ,API The port that will be on the local host 5000 Start the . We can visit http://192.168.1.51:5000/recipe?ingredients=%20pasta%20tomato%20onion Get information about spaghetti 、 Recommended recipes for tomatoes and onions .

take Flask API Deploy to Heroku

If you use Github, take flaskapi Deploy to Heroku Very easy to ! First , I created a file in my project folder without an extension Procfile file . You just type in the file :

web: gunicorn app:app

The next step is to create a requirements.txt The file of , It contains all the things I use in this project python library .

If you work in a virtual environment ( I use conda), have access to pip freeze > requirements.txt, Make sure you are running in the correct working directory , Otherwise, it will save the file elsewhere .

Now all I have to do is commit the changes to the Github Repository , Then follow the above deployment steps https://dashboard.heroku.com/apps. If you want to try or use mine API, Please visit :

  • https://whats-cooking-recommendation.herokuapp.com/- If you're in America
  • https://whatscooking-deployment.herokuapp.com/- If you are in Europe
  • If you're somewhere else , Either way , It's just a little slower

Docker

We have now reached such a stage , I'm happy with the model I've built , So I hope to be able to distribute my model to others , So they can use it, too .

I've uploaded my entire project to Github, But that's not enough . Just because the code works on my computer doesn't mean it will work on someone else's computer .

If I distribute the code , I copy my computer , So I know it will work , That would be great . One of the most popular methods is to use Docker Containers . The first thing I did was create a Dockerfile Of docker file ( It has no extension ). In short ,docker The document tells us how to build the environment , And contains all commands that the user can call in the command line to assemble the image. .

#  This includes where to get the image ( operating system )
FROM ubuntu:18.04

MAINTAINER Jack Leitch 'jackmleitch@gmail.com'

#  Auto press Y
RUN apt-get update && apt-get install -y \
    git \
    curl \
    ca-certificates \
    python3 \
    python3-pip \
    sudo \
    && rm -rf /var/lib/apt/lists/*
    
#  Set up the working directory 
WORKDIR /app

#  take currect Copy everything in the directory to app Directory .
ADD . /app

#  All installation requirements 
RUN pip3 install -r requirements.txt

#  download wordnet As it is used for word form reduction 
RUN python3 -c "import nltk; nltk.download('wordnet')"

# CMD Execute after the container is started 
CMD ["python3", "app.py"]

Once I created it docker file , I need to build my container — It's very simple .

sidenote : If you do , Make sure you have all file paths ( I put mine in one config.py In file ) It's not specific to your computer , because docker It's like a virtual machine , Contains its own file system , for example , You can put ./input/df_recipes.csv.

docker build -f Dockerfile -t whatscooking:api

Start on any machine API(!), What we need to do now is ( Suppose you have downloaded it docker Containers ):

docker run -p 5000:5000 -d whatscooking:api

If you want to inspect the container yourself , Here's a link to mine Docker Bub:https://hub.docker.com/repository/docker/jackmleitch/whatscooking. You can drag the image in the following ways :

docker pull jackmleitch/whatscooking:api

The next plan is to use Streamlit Building a better API Interface .

This article is from WeChat official account. - Pan Chuang AI(xunixs) , author :VK

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

Original publication time : 2020-12-21

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Pan Chuang AI]所创,转载请带上原文链接,感谢
https://cdmana.com/2020/12/20201224134551929w.html

Scroll to Top