Lights, Camera, Action: Building a Multilingual Movie Recommender! đȘ

Have you ever found yourself spending way too much time browsing through a streaming platform to find the perfect movie? We've all been there, scrolling through countless titles and descriptions, trying to pinpoint something that truly matches our interests.
In this article, we'll walk you through the process of building your own movie recommendation app, built by Cohere's Machine Learning Engineer, Amr Kayid. Once completed, you'll be able to simply describe the type of movie you'd like to watch, and the app will provide suggestions tailored to your preferences. What's more, the app is capable of performing multilingual searches, allowing you to describe your ideal movie in various languages. This is made possible by leveraging Cohere's multilingual models, which enable the embedding of movie descriptions into language-invariant representations.
Follow the steps below to build your movie recommendation app:
Step 1: Gather movie data
Step 2: Build a user interface to get movie descriptions
Step 3: Filter movies by language
Step 4: Embed the movie description
Step 5: Calculate movie similarity
Step 6: Show similar movies
Step 7: Setup and run
Building a Movie Recommendation App
Youâll build an app that takes movie descriptions in any language and recommends similar movies via Cohereâs multilingual models. Youâll also be able to filter movies by language and choose how many to display in an interactive and user-friendly interface.
Prerequisites
To complete this project, register for an account with Cohere and generate your API key to access the Cohere APIâs resources.
The project code on GitHub contains the following files:
- The
movies.py
file contains the code of the movie recommender. - The
utils.py
file provides some utilities. - The
requirements.txt
file lists the project dependencies. - The
data/the_movies_with_embeddings.json
file contains a movie dataset and precomputed embeddings of each movieâs description. - The
Makefile
builds a virtual environment for the project.
You can use the Makefile
to create a virtual environment with the required libraries and dependencies using the make
command or just install the necessary dependencies by running the pip install -r requirements.txt
command either in a notebook or in your local environment.
You should also set up your .env
file at the root of your directory to hold your Cohere API key. To get the API key, log in to the Cohere dashboard and navigate to the API Keys section. Keep this information private!
Step 1: Gather Movie Data
To begin, youâll build a movie database containing fields for the title, description, original language, and more.
You can get this data from a variety of sources, such as the Internet Movie Database (IMDB), but this tutorial will use a premade JSON file (the_movies_with_embeddings.json)
, which already contains information from nearly 45,000 movies. This file also includes pre-calculated embeddings of each movie's description. Youâll see how to embed a movie description later on.
The code below builds the movie database. The function setup_movie_database
defines the fields, cleans the data by removing null values, and then fills in empty fields, adds a movie ID field, and returns the movie database and the embeddings representing movie descriptions. The movies_df and candidates
variables store the movie database and the embeddings, respectively.
@st.cache()
def setup_movie_database():
MOVIE_FIELDS = ["movieId", "id", "imdb_id", "original_title", "title", "overview",
"genres", "release_date", "language_code", "lang2idx", "language_name",
"embeddings"]
movies_df = pd.read_json("./data/the_movies_with_embeddings.json", orient="index")
movies_df = movies_df.dropna(subset=['imdb_id'])
movies_df = movies_df.fillna("")
movies_df['movieId'] = movies_df.index
movies_df = movies_df[MOVIE_FIELDS]
candidates = np.array(movies_df.embeddings.values.tolist(), dtype=np.float32)
return movies_df, candidates
movies_df, candidates = setup_movie_database()
movies_available_languages = sorted(movies_df.language_name.unique().tolist())
print(f"Movie database ready! We have {len(movies_df)} movies in our database in
{len(movies_available_languages)} different languages.")
Running this function should give the following output: âMovie database ready! You have 44497 movies in your database in 112 different languages.â
Step 2: Build a User Interface to Get Movie Descriptions
In this step, youâll use the Streamlit Python library to build an interactive and user-friendly interface in your browser.
To improve the interface design, use the streamlit_header_and_footer_setup
function implemented in the utils.py
file. This will set up a header and footer for your app, apply custom CSS to style both, and incorporate the Cohere brand logo into the header.
Next, title the app âMovies Search and Recommendation.â The st.text_input
function will create a text input box for the user to enter a movie description.
Step 3: Filter Movies by Language
Next, youâll create an expandable section called Search Fields, Expand me! in your interface. This section will include two fields:
- A slider for the user to select the number of movies to show in the app
- A drop-down menu that displays the available movie languages in the database, allowing the user to choose multiple desired languages for movies
The following code also defines the movie data fields to display in the interface.
search_expander = st.expander(label='Search Fields, Expand me!')
with search_expander:
limit = st.slider("Number of movies to show", min_value=1, max_value=100,
value=5, step=1)
selected_languages = st.multiselect(label=f"Desired languages | Number of Unique
languages: {len(movies_available_languages)}",
options=movies_available_languages)
output_fields: List[str] = ["movieId", "id", "imdb_id", "original_title",
"title", "overview", "genres", "release_date", "language_code", "lang2idx",
"language_name"]
Step 4: Embed the Movie Description
In this step, youâll use the Cohere embed endpoint to embed a movie description provided by the user. Start by setting up the Cohere API with your API key.
Use the multilingual-22-12
model to embed the movie description. If the text is longer than 4096 tokens, the co.embed
function (truncate
parameter) will truncate it for you. The co.embed function will convert the query_text
into an embedding, which is a 768-dimensional vector of floats that uniquely represents a movie description.
Text that means the same thing in different languages will be close together in the embedding space. To achieve this, Cohereâs embed endpoint uses a multilingual model to represent text in a language-agnostic way.
load_dotenv(".env")
COHERE_API_KEY = os.environ.get("COHERE_API_KEY")
co = cohere.Client(COHERE_API_KEY)
model_name: str = 'multilingual-22-12'
vectors_to_search = np.array(co.embed(model=model_name, texts=[query_text],
truncate="RIGHT").embeddings, dtype = np.float32)
Step 5: Calculate Movie Similarity
To find the movies that are most similar to a movie description, youâll perform the dot product of the resultant vectors in the get_similarity function. This function calculates the similarity between a target movie description and a set of candidate movie descriptions in the embedding space. This step is crucial and determines the quality of the recommendation system.
Youâll get the highest cosine similarity scores using the torch.topk function. Notice that the PyTorch library functions expect tensors as input. Use torchify lambda function to convert your arrays to tensors.
torchfy = lambda x: torch.as_tensor(x, dtype=torch.float32)
def get_similarity(target: List[float], candidates: List[float], top_k: int):
candidates = torchfy(candidates).transpose(0, 1)
target = torchfy(target)
cos_scores = torch.mm(target, candidates)
scores, indices = torch.topk(cos_scores, k=top_k)
similarity_hits = [{'id': idx, 'score': score} for idx, score in
zip(indices[0].tolist(), scores[0].tolist())]
return similarity_hits
result = get_similarity(vectors_to_search, candidates=candidates, top_k=limit)
print(result)
Step 6: Show Similar Movies
Now, itâs time to display the most similar movies to a given movie description. Youâll process the outputs of the get_similarity function to create a dictionary called similar_results. The dictionary maps indices to a nested dictionary, where each dictionary contains movie information.
similar_results = {}
for index, hit in enumerate(result):
similar_example = sub_movies_df.iloc[hit['id']]
similar_results[index] = {movie_field: similar_example[movie_field] for movie_field
in output_fields}
print(similar_results)
Use the Streamlit library to display the movies as a grid with five columns per row. Each entry in the grid shows the movie ID, URL, cover image, title, overview, genres, release date, language, and distance. If any of the values are missing or cannot be retrieved, the code moves on to the next movie.
for index in range(0, len(similar_results), 5):
cols = st.columns(5)
for i in range(5):
try:
genres = [genre['name'] for genre in eval(similar_results[index+ i]['genres'])]
cols[i].markdown(f"**movieId**: {similar_results[index + i]['movieId']}")
cols[i].markdown(f"**URL**: https://www.imdb.com/title/{similar_results[index + i]['imdb_id']}/")
try:
imdb_id = similar_results[index + i]['imdb_id'].replace("tt", "")
image = images_cache[imdb_id] = images_cache.get(imdb_id,
ia.get_movie(imdb_id).data['cover url'])
cols[i].markdown(f'<img src="{image}"
style="width:100%;height:75%;border-radius: 5%;">',
unsafe_allow_html=True)
except:
pass
cols[i].markdown(f"**Original Title**: {similar_results[index + i]['original_title']}")
cols[i].markdown(f"**English Title**: {similar_results[index + i]['title']}")
cols[i].markdown(f"**Overview**: {similar_results[index + i]['overview']}")
cols[i].markdown(f"**Genres**: {genres}")
cols[i].markdown(f"**Release Data**: {similar_results[index + i]['release_date']}")
cols[i].markdown(f"**Language**: {similar_results[index + i]['language_name']}")
cols[i].markdown(f"**Distance**: {similar_results[index + i]['distance']}")
except:
continue
Step 7: Set Up and Run
To run the movie recommendation app, open a new terminal and execute streamlit run movies.py. This gives you a link in your terminal to view your app in your browser.
Input a movie description such as: âA hard-nosed cop reluctantly teams up with a wise-cracking criminal, temporarily paroled to him, to track down a killer."
Select English, Romanian, and Spanish language in the expandable section. You should get the following.
Now, use DeepL to translate the movie description to Korean or a different language. You should get the movie 48 Hrs. in your outputs since youâre using Cohereâs multilingual model.
You Just Built a Movie Recommendation App!
By following the steps in this article, youâve created a movie recommendation app that recommends movies based on a description written in any language. To accomplish this, you used Cohereâs multilingual model, which embeds movie descriptions into a language-invariant embedding space, capturing similarities between movies regardless of language-specific features.
Ready to jazz up your language game? đ·đ Sign up for a FREE Cohere account and unlock the awesomeness of multilingual language models! Your world's about to get way more colorful and linguistically lit! đ„đ Let's go, language explorers! âš