You can find Part 1 here.

It can be a bit overwhelming for someone new to Large Language Models (LLMs) to understand when and where to use them in natural language processing (NLP) use cases. In this blog series, we simplify LLM application by mapping out the seven broad categories of use cases that you can address with Cohere’s LLM.

In Part 1 of our series, we covered the first four use case categories: Generate, Summarize, Rewrite, and Extract. In this post, we will cover the other three: Search, Cluster, and Classify. Finally, we’ll look at how we can combine the different types, making their applications much more interesting and useful.

5. Search/Similarity

Any mention of LLMs will most likely spark discussion around their text generation capabilities, as we’ve seen in the previous four use cases. The less-talked-about, but equally powerful capability, is text representation.

While text generation is about creating new text, text representation is about making sense of existing text. Think about the amount of unstructured text data being generated today that’s only accelerated by the increasingly ubiquitous internet. It would not be possible for humans to process this massive volume of information without NLP-powered automation.

One such use case category for text representation is similarity search. Given a text query, the goal is to find documents that are most similar to the query.

The most obvious example use case for this is search engines. As users, we expect the search results to return links and documents that are highly relevant to our query. What makes modern search engines work very well is their ability to match the query to the appropriate results not just via keyword-matching, but by semantic similarity.

In simple words, they are able to perform matching based on meaning, context, themes, ideas — abstract concepts that may use different words altogether, but very much relate to each other.

Let’s say a user enters the search string “ground transportation at the airport.” The search engine must be able to know that the user is looking for taxis, car rentals, trains, or other similar services, even if the user doesn’t explicitly mention them.

When we input a piece of text into a representation model, instead of generating more text, the model generates a set of numbers that represent the meaning or context of the input text. These numbers are called “text embeddings”. In LLMs, they tend to be a very long sequence of numbers, typically in the thousands, and the longer they are, the more information is stored about the text.

With Cohere, you can access this type of model via the Embed endpoint. This Python notebook provides an example of a semantic search application, where given a question, the search engine would return other frequently asked questions (FAQ) whose text embeddings are the most similar to the question.

It goes on to show all the questions on a two-dimensional plot, shown in the image below, where the closer two points are on the plot, the more semantically similar they are.

Two examples of similar questions about sharks and Boxing Day

This concept can be applied to a much broader range of use cases, for example:

Retrieval of related and useful documents within an organization
Similar product recommendations
eCommerce product search
Next article recommendations based on reading history
Selecting chatbot responses from an available list

6. Cluster

Clustering is another use case category that leverages text embeddings. The idea is to take a group of documents and make sense of how they are organized and how they are related to each other.

In the previous use case, we visualized a set of documents on a plot to get a sense of how a set of documents are similar, or different, from each other. Clustering uses the same principles, but adds another step of organizing them into groups. This can be done via clustering algorithms, for example, k-means clustering, where we specify the number of clusters and the algorithm will return the appropriate cluster associated with each piece.

This Python notebook, also leveraging the Embed endpoint, goes into detail about how to make sense of three thousand “Ask HN” (Hacker News) posts. First, the text embeddings for each are generated. This is followed by clustering them into smaller groups by the theme or topic of the posts, supplemented by the keywords that represent the topic of each group.

Finally, these posts are visualized on a plot, shown in the image below, where one color represents a topic cluster. Below you can see a few topics emerging, such as life, career, coding, startups, and computer science.

Eight clusters from the top 3,000 Ask HN posts, with each set of keywords representing a topic

This technique can be applied to number of different tasks, such as:

Organizing customer feedback and requests into topics
Segmenting products into categories based on product descriptions
Turning ESG reports and news into themes
Organizing a huge corpus of company documents
Discovering emerging themes in survey responses analysis

7. Classify

Last but not least is the text classification category, and that’s because it is probably the most widely applicable use of NLP today. You can think of it as similar to clustering, with a slight twist.

Clustering is called an “unsupervised learning” algorithm. That’s because we don’t know what the clusters are beforehand — we assign a number of clusters (we can choose any number), and the algorithm will group the documents we give according to that number.

On the other hand, classification is a “supervised learning” algorithm, because this time, we already know beforehand what those clusters, or more precisely classes, are.

For example, say we have a list of eCommerce customer inquiries, and for routing purposes, we would like to categorize each of them into one of three classes: Shipping, Returns, and Tracking. To make the classifier work, we first need to train it by showing it enough examples of a piece of text, such as “Do you offer same day shipping?”, and its actual class, which in this case is Shipping.

This notebook provides an example of using the Classify endpoint to perform sentiment analysis, which is a classification task that classifies a piece of text into one of the Positive, Neutral, and Negative classes.

Some example areas where text classification can be useful include:

Content moderation for toxic comments on online platforms
Intent classification in chatbots
Sentiment analysis on social media activity
eCommerce product categorization
Assigning customer support tickets to the right teams

Conclusion

With these examples, we are only just scratching the surface. The possibilities of using LLMs are limited only by our imagination. This is an exciting time when any developer and team, not just the big players anymore, can tackle some of the toughest NLP challenges by leveraging cutting-edge AI technologies that are made available via simple API calls.