You can find Part 1 here.

If you are a developer looking to add language processing capabilities to your project, this guide is for you. We’ll give you a quick tour of what’s possible with language AI using Cohere’s Large Language Model (LLM) API. Our guide is essentially your “Hello, World!” of language AI, and since this is all about language processing, we’ll start by exploring the phrase Hello, World! itself!

In Part 1 of our series, we mentioned the three groups of tasks that you will typically work on when dealing with language data:

Generating text
Classifying text
Analyzing text

We covered Generating text in Part 1. In this post, we’ll continue with the other two groups and finally, summarize all three endpoints in a table for easy reference.

This guide assumes little or no background in machine learning or NLP. The examples are shown using Python.

2 - Classifying text

The full source code for the examples in this section is available in this Colab notebook.

While the previous section is about language generation, the next two are about language understanding. LLMs have been pre-trained with a vast amount of training data, allowing them to capture how words are being used and how their meaning changes depending on the context.

A very common application of this is text classification. Cohere’s Classify endpoint makes it easy to take a list of texts and predict their categories, or classes.

If you have read a little bit about text classification, you may have come across sentiment analysis, which is the task of classifying the sentiment of a text into a number of classes, say, positive, negative, or neutral. This is useful for applications like analyzing social media content or categorizing product feedback. So, why don’t we try that with the phrase Hello, World!

A human can easily tell you that “Hello, World! What a beautiful day” conveys a positive sentiment, but let’s see if our models can do that too. And while we’re at it, let’s try classifying other phrases that you might find on social media.

Prepare input

A typical machine learning model requires many training examples to perform text classification, but with the Classify endpoint, you can get started with as few as five examples per class. With the Classify endpoint, the input you need to prepare is as follows:

Examples

These are the training examples we give the model to show the output we want it to generate.
Each example contains the text itself and the corresponding label, or class.
The minimum number of examples required is two per class.
You can have as many classes as possible. If you are classifying text into two classes, that means you need a minimum of four examples, and so on.

Inputs

These are the list of text pieces you’d like to classify. We have twelve in total.

Our sentiment analysis classifier has three classes with five examples each: “Positive” for a positive sentiment, “Negative” for a negative sentiment, and “Neutral” for a neutral sentiment. The code looks as follows.

The examples:

from cohere import ClassifyExample

examples = [ClassifyExample(text="I’m so proud of you", label="positive"), 
            ClassifyExample(text="What a great time to be alive", label="positive"), 
            ClassifyExample(text="That’s awesome work", label="positive"), 
            ClassifyExample(text="The service was amazing", label="positive"), 
            ClassifyExample(text="I love my family", label="positive"), 
            ClassifyExample(text="They don't care about me", label="negative"), 
            ClassifyExample(text="I hate this place", label="negative"), 
            ClassifyExample(text="The most ridiculous thing I've ever heard", label="negative"), 
            ClassifyExample(text="I am really frustrated", label="negative"), 
            ClassifyExample(text="This is so unfair", label="negative"),
            ClassifyExample(text="This made me think", label="neutral"), 
            ClassifyExample(text="The good old days", label="neutral"), 
            ClassifyExample(text="What's the difference", label="neutral"), 
            ClassifyExample(text="You can't ignore this", label="neutral"), 
            ClassifyExample(text="That's how I see it", label="neutral")]

The inputs:

inputs=["Hello, world! What a beautiful day",
        "It was a great time with great people",
        "Great place to work",
        "That was a wonderful evening",
        "Maybe this is why",
        "Let's start again",
        "That's how I see it",
        "These are all facts",
        "This is the worst thing",
        "I cannot stand this any longer",
        "This is really annoying",
        "I am just plain fed up"
        ]

Get output

With the Classify endpoint, setting up the model is quite straightforward. The main thing to do is define the model type. For our example, we’ll use the default, embed-english-v2.0.

Putting everything together with the Classify endpoint looks like the following:

def classify_text(inputs,examples):
  response = co.classify(
    model='embed-english-v2.0',
    inputs=inputs,
    examples=examples)
  
  classifications = response.classifications
  
  return classifications

Together with the predicted class, the endpoint also returns the confidence value of the prediction (between 0 and 1). These confidence values are split among the classes, in this case three, in which the values add up to a total of 1. The classifier then selects the class with the highest confidence value as the “predicted class.” A high confidence value for the predicted class therefore indicates that the model is very confident of its prediction, and vice versa.

Here’s a sample output returned:

Input: Hello, world! What a beautiful day
Prediction: positive
Confidence: 0.89
----------
Input: It was a great time with great people
Prediction: positive
Confidence: 0.99
----------
Input: Great place to work
Prediction: positive
Confidence: 0.89
----------
Input: That was a wonderful evening
Prediction: positive
Confidence: 0.97
----------
Input: Maybe this is why
Prediction: neutral
Confidence: 0.70
----------
Input: Let's start again
Prediction: neutral
Confidence: 0.82
----------
Input: That's how I see it
Prediction: neutral
Confidence: 1.00
----------
Input: These are all facts
Prediction: neutral
Confidence: 0.78
----------
Input: This is the worst thing
Prediction: negative
Confidence: 0.93
----------
Input: I cannot stand this any longer
Prediction: negative
Confidence: 0.92
----------
Input: This is really annoying
Prediction: negative
Confidence: 0.99
----------
Input: I am just plain fed up
Prediction: negative
Confidence: 1.00
----------

The model returned a Positive sentiment for “Hello, world! What a beautiful day,” which is what we would expect! And the predictions for all the rest look spot on too.

That was one example, but you can classify any kind of text into any number of possible classes according to your needs. In a recent blog post, we dived deeper into text classification and its use cases.

As your task becomes more complex, you will likely need to bring in additional training data and fine-tune a model. This will ensure that the model can capture the nuances specific to your task and realize performance gains. The docs provide more about fine-tuning for the Classify endpoint.

3 - Analyzing text

The full source code for the examples in this section is available in this Colab notebook.

The next area in language understanding is a broad one, which is analyzing text. Cohere’s Embed endpoint takes a piece of text and turns it into a vector embedding. Embeddings represent text in the form of numbers that capture its meaning and context.

This gives you the ability to turn unstructured text data into a structured form. It opens up ways to analyze and extract insights from them. Let’s take a look at a couple of examples.

Semantic search

The first example is semantic search. There was a time when web search engines relied on keywords to match your search queries to the most relevant sites. But these days, you would be one frustrated user if that’s the kind experience you get, because these search engines are now able to capture semantic understanding of what you are looking for, beyond just keyword-matching.

Let’s build a simple semantic search engine. Here we have a list of 50 top web search terms about Hello, World! taken from a keyword tool. The following are a few examples:

df = pd.read_csv("hello-world-kw.csv", names=["search_term"])
df.head()

	keyword
0	how to print hello world in python
1	what is hello world
2	how do you write hello world in an alert box
3	how to print hello world in java
4	how to write hello world in eclipse

Let’s pretend that these search terms make up an FAQ database. Our job now, given a new query, is to ensure that the search engine returns the most similar FAQs.

Here's how to call the Embed endpoint:

Prepare input — The input is the list of text you want to embed.
Define model settings — The model setting is just one: the model type. But it does make a difference to your task because bigger models generate embeddings with higher dimensions. We’ll use embed-english-v2.0.
Define type of input - In the input_type parameter, use search_document for the documents to search from and search_query for the query. More options are described in the API Reference document.
clustering: Use this when you want to cluster the embeddings.
Generate output — The output is the corresponding embeddings for the input text.

The code looks like this:

def embed_text(texts, input_type):
  response = co.embed(
                model="embed-english-v2.0",
                input_type=input_type,
                texts=texts)
  
  return response.embeddings

Now, given the FAQs, let’s try the search term “what is the history of hello world.” This is a search term whose keyword (i.e., “history”) doesn’t exist at all in the FAQ. Let’s see how the search fares.

First we get the embeddings of all the FAQs:

df["search_term_embeds"] = embed_text(texts=df["search_term"].tolist(),
                                      input_type="search_document")
doc_embeds = np.array(df["search_term_embeds"].tolist())

And then get the embeddings of the new query:

query = "what is the history of hello world"
query_embeds = embed_text(texts=[query],
                          input_type="search_query")[0]```

Next, we compare the similarity of the embeddings of the new query with each of the embeddings of the FAQs. There are many options to do this, and one option is using cosine similarity. We’ll utilize scikit-learn’s library to perform this.

The steps are:

Calculate similarity between the new query with each of the FAQs
Sort the FAQs by descending order in similarity (the most similar first)
Show the top FAQs with the highest similarity to the new query

The code is shown below:

from sklearn.metrics.pairwise import cosine_similarity

def get_similarity(target,candidates):
  # Turn list into array
  candidates = np.array(candidates)
  target = np.expand_dims(np.array(target),axis=0)

  # Calculate cosine similarity
  sim = cosine_similarity(target,candidates)
  sim = np.squeeze(sim).tolist()

  # Sort by descending order in similarity
  sim = list(enumerate(sim))
  sim = sorted(sim, key=lambda x:x[1], reverse=True)

  # Return similarity scores
  return sim

similarity = get_similarity(new_query_embeds,embeds)

# Show the top 5 FAQs with the highest similarity to the new query
for idx,score in similarity[:5]:
  print(f"Similarity: {score:.2f};", df.iloc[idx]["search_term"])

And the output we get is:

New query:
what is the history of hello world 

Similar queries:
Similarity: 0.58; how did hello world originate
Similarity: 0.56; where did hello world come from
Similarity: 0.54; why hello world
Similarity: 0.53; why is hello world so famous
Similarity: 0.53; what is hello world

It works! Notice that the top terms are indeed the closest in meaning to the search term (about the history and origin of Hello, World!) even though they use different kinds of words.

Semantic exploration

Moving on to the second example. Here, we take the same idea that we see in semantic search and take a broader look, which is exploring huge volumes of text and analyzing their semantic relationships.

Let’s keep our example simple and use the same 50 top web search terms about Hello, World! Its volume is by no means small, but it’s good enough to illustrate the idea.

We used the `large` endpoint to generate the embeddings. At the time of writing, this model generates embeddings of 4,096 dimensions. This means, for every piece of text passed to the Embed endpoint, a sequence of 4,096 numbers will be generated. Each number represents a piece of information about the meaning contained in that piece of text.

To understand what these numbers represent, there are techniques we can use to compress the embeddings down to just two dimensions while retaining as much information as possible. And once we can get it down to two dimensions, we can plot these embeddings on a 2D plot.

We can make use of the UMAP technique to do this. The code is as follows:

import umap

# Compress the embeddings to 2 dimensions (UMAP’s default reduction is to 2 dimensions)
reducer = umap.UMAP(n_neighbors=49) 
umap_embeds = reducer.fit_transform(doc_embeds)

# Store the compressed embeddings in the dataframe/table
df['x'] = umap_embeds[:,0]
df['y'] = umap_embeds[:,1]

You can then use any plotting library to visualize these compressed embeddings on a 2D plot.

Here is the plot showing all 50 data points:

The 2D embeddings plot of all 50 data points

And here are a few zoomed in plots, clearly showing text of similar meaning being closer to each other.

Example #1: Hello, World! In Python

A zoomed-in 2D embeddings plot around the topic of Hello, World! in Python

Example #2: Origins of Hello, World!

A zoomed-in 2D embeddings plot around the topic of Origins of Hello, World!

These kinds of insights enable various downstream analysis and applications, such as topic modeling, by clustering documents into groups. In other words, text embeddings allows us to take a huge corpus of unstructured text and turn it into a structured form, making it possible to objectively compare, dissect, and derive insights from all that text.

If you’d like to learn more about text analysis, here are some additional resources:

An example application: combing for insight in 10k Hacker News posts
Intuition about text embeddings, explained visually
Some use case ideas with text embeddings
Embed endpoint API reference

Conclusion

Well, that wasn’t really quick as promised! But hopefully you are excited as I am to dive further into language AI and explore ways to unlock new kinds of applications. The whole category is relatively new and the boundary of what’s possible is continuously being pushed. I’m excited to see what you will build with Cohere!

To try these endpoints out for yourself, sign up for a Cohere account now.