Generative AI with Cohere: Part 3 - The Generate Endpoint

In Part 3, we move from Playground to code and begin our exploration of the Generate endpoint.

Generative AI with Cohere: Part 3 - The Generate Endpoint
Generative AI with Cohere: Part 3 - The Generate Endpoint

In Part 1 and 2 of our series on generative AI, we looked at how we can use the Playground to experiment with ideas in text generation. And let’s say we have now found that idea which we want to build on, so what’s next?

In Part 3, we will begin our exploration of the Cohere Generate endpoint, one of the endpoints available from the Cohere API. We’ll move from the Playground to code, in this case via the Python SDK, and learn how to use the endpoint.

This article comes with a Google Colaboratory notebook for reference.

Here’s a quick look at how to generate a piece of text via the endpoint. It’s actually quite straightforward.

We enter a prompt:

response = co.generate(
  prompt='Write me a haiku about software and magic.')

And we get a response:

Magic of software
creating worlds from code
Imagination unleashed

Of course, there are more options available for you to define your call in a more precise way. In this article, we will cover that and more, including:

But before going further, let’s take a quick step back and reflect on what this all means.

Looking at the code snippet above, it’s easy to miss what’s at play here. There are many options out there for leveraging large language models (LLMs), from building your own models to deploying available pre-trained models. But with a managed LLM platform like Cohere, what you get is a simple interface to language technology via easy-to-use endpoints.

What this means is that you are free to focus solely on building applications instead of having to worry about getting the underlying technology to work. You don’t have to worry about the complexities, resources, and expertise required to build, train, deploy, and maintain these AI models.

Managed LLMs like Cohere's let you focus on building your applications
Managed LLMs like Cohere's let you focus on building your applications

Now let’s dive deeper into the Generate endpoint.

Overview of the Generate Endpoint

The Cohere platform can be accessed via the Playground, SDK, or the CLI tool. In this article, we’ll learn how to work with the Python SDK.

With the API, you can access a number of endpoints, such as Generate, Embed, and Classify. Different endpoints are used for different purposes and produce different outputs. For example, you would use the Classify endpoint when you want to perform text classification.

Our focus for this series is, of course, text generation, so we’ll work with the Generate endpoint.

Setting Up

First, if you haven’t done so already, you can register for a Cohere account and get a trial API key, which is free to use. There is no credit or time limit associated with a trial key; calls are rate-limited to 100 calls per minute, which is typically enough for an experimental project. Read more about using trial keys in our documentation.

Next, you need to install the Python SDK. You can install the package with this command:

pip install cohere

Making the First API Call

Now, to get a feel of what the Generate endpoint does, let’s try it with the code snippet we saw earlier.

Import the Cohere package, define the Cohere client with the API key, and add the text generation code, like so.

import cohere
co = cohere.Client('your_api_key')

response = co.generate(
  prompt='Write me a haiku about software and magic.')

Run it on a Python interpreter and you get the response:

Magic of software
creating worlds from code
Imagination unleashed

And that’s our first API call!

Understanding the Response

Let’s understand what we get from the API response.

The Generate endpoint accepts a text input, that is the prompt, and outputs a Generation object.

The Generate endpoint's input (prompt) and output (response)
The Generate endpoint's input (prompt) and output (response)

Here’s an example response, with the text generated:

[cohere.Generation {
	id: 8f7541b8-1e85-4784-a848-7aaad74f7bbe
Magic of software
creating worlds from code
Imagination unleashed
	likelihood: None
	token_likelihoods: None

The response contains:

  • id — A unique ID for the generation
  • text — The generated text, given the input
  • likelihood — The average likelihood of all generated tokens (one English word roughly translates to 1-2 tokens)
  • token_likelihoods — The likelihood of each token

If you want to keep it simple, you just need to get the text output and you’re good to go, like what we have done so far: response.generations[0].text.

Here, we defined the the index of the generation (index 0, representing the first item) because the endpoint can actually generate more than one output in one go. We’ll cover how to get multiple generations later in this article, but here, we have not specified any number of outputs, so there is only one output by default.

In some scenarios, we may want to evaluate the quality of our output. This is where we can use the likelihood and token_likelihoods outputs.

Note: If you are not sure about what likelihood means, you can read about it in Part 1.

Let’s see them in action. Say our text input is, “It’s great to” and the generated text is, “be out here”.

These three output words map directly to three individual tokens, which we can check using the Tokenize endpoint.

co.tokenize("be out here")


cohere.Tokens {
	tokens: [894, 558, 1095]
	token_strings: ['be', ' out', ' here']

So, back to the Generate endpoint output, let’s say we get the following likelihood values for each token:

  • be: - 2.3
  • out: -0.3
  • here: -1.2
Each generated token has an associated likelihood value
Each generated token has an associated likelihood value

So, the average likelihood of all generated tokens, in this case, is the average of the three tokens, which equals -1.3.

Note that you need to enable the return_likelihoods parameter to return either GENERATION (output only) or ALL (output and prompt), otherwise, you will get None for the likelihood and token_likelihoods outputs.

response = co.generate(
  prompt='Write me a haiku about software and magic.',


[cohere.Generation {
	id: fa9d9499-1418-44ab-8349-6aa57e5ce70d
Software is magic,
It can make you feel,
Like you're in a different world
	likelihood: -0.80018723
	token_likelihoods: [TokenLikelihood(token='\n', likelihood=-0.09764614), TokenLikelihood(token='Software', likelihood=-0.40727282), TokenLikelihood(token=' is', likelihood=-0.17436308), TokenLikelihood(token=' magic', likelihood=-0.06731783), TokenLikelihood(token=',', likelihood=-2.0117483), TokenLikelihood(token='\n', likelihood=-0.19473554), TokenLikelihood(token='It', ...]

Turning Playground Prompts into Code

Recall that in Part 2, we went through a number of prompt ideas, where each comes with its own preset link. You might have come up with some ideas and saved them as presets. The question is, how do you turn those into code?

It’s quite easy actually. If you go back to the Playground and open up a preset, you will see a View Code button. Click on that button and you will get the code version of that preset in the language of your choice, in this case, Python.

Let’s try that with the email writing preset.

Translating Playground prompts into code
Translating Playground prompts into code

You can simply copy and paste this code into the Python interpreter to get the response.

Selecting the Model

We have only used one model so far: command-xlarge-nightly. But as we discussed in Part 1, you can prompt Cohere’s text generation model in two ways: by instruction and by example. So, the first thing that you want to define when calling the endpoint is the model type, depending on how you are constructing your prompt. Here are the available models at the time of writing:

  • Prompting by instruction: command-xlarge-nightly, command-medium-nightly
  • Prompting by example: xlarge, medium

The sizes implied in the model names represent the parameter size of the models. So, which one do you choose? It depends on your use case, but as a rule of thumb, smaller models are faster, while larger models are generally more fluent and coherent.

Understanding the Other Parameters

While defining just the model is enough to get started, in reality, you will likely need to specify other parameters as well, so the model’s output will match your intended output as closely as possible.

For example, if you look at the second example (email writing), the code export contains many more parameters we didn’t define in the first example. We covered some of these parameters in Part 1 in the context of the Playground, so we’ll not cover them again here.

Having said that, we didn’t cover all of the available parameters in Part 1. So, now is probably a good time to visit the Generate endpoint docs and learn more about all the available parameters, for example, their default values, their value ranges, and more.

Experimenting with a Prompt

If you have a prompt idea that you really want to take to the next step, you may want to experiment extensively with it, for example, by trying out different parameter combinations and finding that ideal combination that fits your needs.

You can do that with the Playground, but since you’ll have to manually adjust the settings after each generation, it’s probably not going to be very efficient. That’s when the SDK comes in.

In the following example, we’ll take a prompt (product description) and create a small function to automatically iterate over different text generations. This way we can evaluate the generations in a much faster way.

In particular, we’re going to use the following parameters:

  • temperature — we’ll iterate over a range of values to arrive at a value that fits our use case (we covered what temperature means in Part 1)
  • num_generations — we can use this parameter to get five generations in one go instead of one
  • return_likelihoods — we’ll set this to GENERATIONS and use this to evaluate the randomness of our ext output
Trying out different temperature combinations
Trying out different temperature combinations

And here’s what the code looks like:

# Function to call the Generate endpoint
def generate_text(prompt,temperature,num_gens):
  response = co.generate(
    num_generations = num_gens,
  return response

# Define the prompt
prompt='Write me a haiku about software and magic.'

# Define the range of temperature values and num_generations
temperatures = [x / 10.0 for x in range(0, 60, 10)]
num_gens = 3

# Iterate over the range of temperature values
print(f"Temperature range: {temperatures}")
for temperature in temperatures:
  response = generate_text(prompt,temperature,num_gens)
  for i in range(3):
    text = response.generations[i].text
    likelihood = response.generations[i].likelihood
    print(f'Generation #{i+1}')
    print(f'Text: {text}\n')
    print(f'Likelihood: {likelihood}\n')

You can run the code in the notebook and get the full generation. Here, we are showing a few example outputs, as the full generation is quite long (view the notebook to see the full generation).

Temperature range: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0]
Temperature: 0.0

Generation #1
Software is magic
That makes the world go round
Without it, we're all lost


Temperature: 5.0

Generation #1
Magic is coding,
Programming is sorcery,
Every day is a new spell.

Final Thoughts

In this blog post, we made our first foray into the Generate endpoint using the Python SDK. We got familiarized with how to get text generations via the API, and we created a simple function to help us experiment with a prompt idea.

We have only covered the basics, though. In upcoming articles, we’ll look at how we can integrate the endpoint into proper applications, such as adding user interfaces, working with other endpoints in tandem, and more.

In the meantime, get your free API key to start building with generative AI.