In a previous article, we talked about different use cases for Large Language Models. While generating text is the most popular, there's a lot of fun to be had with use cases like text embeddings to perform topic modeling.

Topic modeling uses unsupervised learning to extract topics or themes from a collection of documents. Data scientists also apply clustering methods to processes like automatic document organization and rapid information retrieval or filtering.

Perhaps the most impressive facet of clustering is that — despite its powerful potential — it’s incredibly easy to integrate into your applications. In this demo, we’ll use the Cohere Embed endpoint to plot out and cluster a list of AI research papers and identify trends.

We’ll write a simple application that scrapes the Journal of Artificial Intelligence Research, performing semantic searching and clustering of paper titles to discover trends in AI. Our application will output the results as a list of recently published AI-themed papers.

Then, we’ll use Cohere’s Embed endpoint to generate word embeddings using our list of AI papers, which we will visualize and use to build our semantic search and topic modeling application.

Prerequisites

To follow along with this tutorial, you’ll need:

Familiarity with Python.
Python version 3.6 or later installed on your development machine. Alternatively, you can use Google Colab to build the project in the cloud.
A Cohere account. Register for a new Cohere account to receive $75 worth of credits. Once you’ve used your credits, you’ll have access to a pay-as-you-go option.

You can find the full project code on GitHub.

Getting started

First, install the Python dependencies required to run the project using the command below:

pip install requests beautifulsoup4 cohere altair clean-text numpy pandas sklearn wordcloud matplotlib

Next, you’ll need to create an API key to use the Cohere Platform. To do this, log into your Cohere account and click Dashboard.

Next, click the Create API Key button in the API Keys window and assign a name to your new key. Ensure that you copy this API key to integrate with Cohere.

Now, create a new folder in your development machine. Inside the folder, create a new Python file named cohere_nlp.py. Write all of your code in this file.

Then, import the dependencies and initialize Cohere’s client:

import cohere
# Paste your API key here. Remember to not share it publicly
api_key = '<API-KEY>'
co = cohere.Client(api_key)

Data collection and cleaning

Since this tutorial focuses on applying topic modeling to look for recent trends in AI, you need to source a list of titles of AI papers. To do this, you’ll need to use web scraping techniques to collect a list of papers, with the Journal of Artificial Intelligence Research serving as your data source. Finally, we will clean this data by removing unwanted characters.

First, import the required libraries to make web requests and process the web content.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from cleantext import clean

Next, make an HTTP request to the source website that has an archive of the AI papers.

URL = "https://www.jair.org/index.php/jair/issue/archive"
page = requests.get(URL)

Use this archive to get the published list of AI papers. While this archive features papers published beginning in 2015, our code filters for papers published more recently (on or after 2020).

soup = BeautifulSoup(page.content, "html.parser")
archive_links = []
for link in soup.select('a.title'):
vol = link.text
link = link.get('href')
# split year from the volume eg Vol. 73 (2020)
year = int(vol[vol.find("(")+1:vol.find(")")])
if year >= 2020:
archive_links.append({ 'year': year, 'link': link })

Finally, you’ll need to clean the titles of the AI papers gathered. Remove trailing white spaces and unwanted characters.

papers = []
for archive in archive_links:
  page = requests.get(archive['link'])
  soup = BeautifulSoup(page.content, "html.parser")
  links = soup.select('h3.media-heading a')
  for link in links:
    # clean the title
    title = clean(text=link.text,
            lower=True,
            no_line_breaks=False,
            no_numbers=False,
            no_punct=False,
            lang="en")
    papers.append({ 'year': archive['year'], 'title': title, 'link': link.get('href') })

The dataset created using this process has 258 AI papers published between 2020 and 2022. Use the pandas library to create a data frame to hold your text data.

df = pd.DataFrame(papers)

Create and visualize text embeddings

Word embedding is a technique for computers to assign and learn representations of words so that words with similar meanings will have similar representations. An embedding is a list of floating-point numbers that capture the semantic meaning of the represented text. You can use these embeddings to:

Cluster large amounts of text
Match a query with other similar sentences
Perform classification tasks, such as sentiment classification

Cohere’s platform provides an embed endpoint that returns text embeddings. The models used to create these embeddings are available in small, medium, and large sizes. Small models are faster, while large models offer higher-quality results.

Now, you’ll need to create text embeddings using Cohere’s API. The list of titles for AI papers will be used as the input. The embeddings will be stored in a new column inside your dataframe.

df['title_embeds']  = co.embed(
model='large',
texts=df['title'].tolist()).embeddings

That’s all you need to create the word embeddings. Feel free to try it out with the 'small' and 'large' models as well.

Now, you can visualize the embeddings using a scatter plot. First, you’ll need to reduce the dimensions of the embeddings by using the Principal Component Analysis (PCA) method.

Start by importing the necessary packages and creating a function to return the principal components.

# Reduce dimensionality using PCA
from sklearn.decomposition import PCA
# Function to return the principal components
def get_pc(arr,n):
pca = PCA(n_components=n)
embeds_transform = pca.fit_transform(arr)
return embeds_transform

Next, create a function to generate a scatter plot chart. You’ll use the Altair library to create the charts.

import altair as alt
# Function to generate the 2D plot
def generate_chart(df,xcol,ycol,color='basic',title=''):
  chart = alt.Chart(df).mark_circle(size=500).encode(
    x= alt.X(xcol,
      scale=alt.Scale(zero=False),
      axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    y= alt.Y(ycol,
      scale=alt.Scale(zero=False),
      axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    color= alt.value('#333293') if color == 'basic' else color,
    tooltip=['title']
  )
  result = chart.configure(background="#FDF7F0"
        ).properties(
        width=800,
        height=500,
        title=title
       ).configure_legend(
  orient='bottom', titleFontSize=18,labelFontSize=18)
  return result

Finally, use flattened embeddings to create a scatter plot.

sample = 200
embeds = np.array(df['title_embeds'].tolist())
embeds_pca = get_pc(embeds,2)
df_pca = pd.concat([df, pd.DataFrame(embeds_pca)], axis=1)
df_pca.columns = df_pca.columns.astype(str)
generate_chart(df_pca.iloc[:sample],'0','1',title='2D Embeddings')

Here’s a chart that demonstrates the text embeddings for AI papers. It’s important to note that the chart represents a sample size of 200 papers.

Semantic search

Data searching techniques focus on using keywords to retrieve text-based information. You can take this a step further using search queries to determine the information’s intent and contextual meaning.

In this section, you’ll use Cohere to create embeddings for the search query and use the embeddings to compare with your dataset’s embeddings. The output is a list of similar AI papers.

First, create a function to get similarities between two embeddings. This will use the cosine similarity algorithm from the scikit-learn library.

from sklearn.metrics.pairwise import cosine_similarity
def get_similarity(target,candidates):
  candidates = np.array(candidates)
  target = np.expand_dims(np.array(target),axis=0)
  sim = cosine_similarity(target,candidates)
  sim = np.squeeze(sim).tolist()
  sort_index = np.argsort(sim)[::-1]
  sort_score = [sim[i] for i in sort_index]
  similarity_scores = zip(sort_index,sort_score)
  return similarity_scores

Next, create embeddings for the search query.

search_query = "graph network strategies"
search_query_embeds = co.embed( model='medium', texts=[search_query]).embeddings[0]

Now, you can check the similarity between the two embeddings and display the top ten most similar papers using your result.

similarity = get_similarity(search_query_embeds,embeds[:sample])
print('Query:')
print(search_query,'\n')
print('Similar AI papers:')
for idx,sim in similarity:
  if sim >= 0.30:
    df_pca.at[idx,'similar'] = 'yes'
  else:
    df_pca.at[idx,'similar'] = 'no'
  print(f'Similarity: {sim:.2f};',df_pca.iloc[idx]['title'])

Your result should appear similar to what’s shown below.

You can go a step further to visualize the semantic search result in a scatterplot. You’ll use the same column created earlier, which represents if the similarity score is greater than 33%.

# Plot on a chart
generate_chart(df_pca.iloc[:sample],'0','1',color='similar',title='Semantic Search Visualization for Query: ' + search_query)

The plot below shows that the search query, “graph network strategies,” is located closest to the AI papers about puzzles/games, path-finding, and bayesian probability.

Below is another plot displaying the semantic search results for “language and translation.” Similar nodes are located near nodes about linguistics, neural networks, and image captions.

Text clustering

Clustering is the process of grouping similar documents. As a result of clustering, you can discover and map emerging patterns. In this section, you will use the KMeans clustering algorithm to identify the top five clusters of similar papers.

First, import the KMeans algorithm from the scikit-learn package. Then, configure two variables: the number of clusters and a duplicate dataset.

from sklearn.cluster import KMeans
df_clust = df_pca.copy()
n_clusters=5

Next, initialize the KMeans model and use it to fit the embeddings to create the clusters.

kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
classes = kmeans_model.fit_predict(embeds).tolist()
df_clust['cluster'] = (list(map(str,classes)))

K-means is an unsupervised machine learning model, meaning the clusters created will not have meaningful labels. To solve this problem, you are going to create a word cloud for each cluster. This will show you the keywords in each cluster, enabling you to assign a label to each cluster.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
stopwords = set(STOPWORDS)
 
for n in range(n_clusters):
  df_wordcloud = df_clust.loc[df_clust['cluster'] == str(n)]
  text = " ".join(i for i in df_wordcloud.title)
  wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10).generate(text)
  plt.figure(figsize = (8, 8), facecolor = None)
  plt.imshow(wordcloud)
  plt.axis("off")
  plt.tight_layout(pad = 0)
  
  plt.show()
 
# manually create the labels for the clusters  after looking at top words in each cluster
df_clust['cluster'] = df_clust['cluster'].replace(["0",'1','2','3','4'],['Bayesian Networks','Election Manipulation', 'Multi-Agent Learning', 'Language Models', 'Explainable AI'])

The slideshow below shows the word cloud charts for the five clusters created earlier.

Finally, create a scatter plot to visualize the five clusters in your sample size.

df_clust.columns = df_clust.columns.astype(str)
generate_chart(df_clust.iloc[:sample],'0','1',color='cluster',title='Clustering with 5 Clusters')

Your results should appear similar to the example below.

Conclusion

This tutorial used Cohere’s simple and intuitive NLP platform to create word embeddings, perform a semantic search, and cluster text. In this demo, we used a list of AI Research Papers, but it can just as easily be replicated with any other large list of text you want to explore.

Try it out yourself. Register with Cohere to get your API key and $75 worth of credits!