Generative AI with Cohere: Part 4 - Creating Custom Models
In Part 4, we look at how you can create your own models to excel at a specific task without requiring machine learning expertise.

In Part 3 of our blog series on generative AI, we explored the Generate endpoint and used it to experiment with prompts. We tried a range of parameter values to understand the combinations that best serve our goals. The next question is then: can we make the model perform even better at a given task?
The answer is yes, and we can do that via custom models. In Part 4, we’ll look at how we can take the baseline generative model and train our own model on top of it, all without having to have machine learning skills.
This option provides a nice balance — while the Cohere API makes it easy for you to interface with large language models (LLMs) minus all the complexities, you still have the ability to customize a model to your specific task.

In this article, we will cover the following:
But first, let’s see why, and when, you might want to create your own custom models.
A generative model is already trained on a huge volume of data, making it great at capturing patterns of information on a broad scale. But sometimes, your task contains nuances that are highly specific to that scenario. Here are some examples:
- Specific styles: Generating text with a certain style or voice, e.g., when generating product descriptions that represent your company’s brand
- Specific formats: Parsing information from a unique format or structure, e.g., when extracting information from specific types of invoices, resumes, or contracts
- Specific domains: Dealing with text in highly specialized domains such as medical, scientific, or legal, e.g., when summarizing text dense with technical information
- Specific knowledge: Generating text that closely follows a certain theme, e.g., when generating playing cards that are playable, like what we did with Magic the Gathering
In these cases, with enough examples in the prompt, you might still be able to make the generation work. But there is an element of unpredictability — something you want to eliminate when looking to deploy your application beyond a basic demo.
In these kinds of scenarios, you may want to experiment with custom models and compare how they perform against the baseline model, and then decide on the best option.
What are Custom Models?
To understand what a custom model is and how it works, it’s good to know a couple of terms commonly used in LLMs: pre-training and finetuning.
Pre-training is the process of training a language model on a large amount of text data to learn the general patterns and structures of language. By doing this, the model is able to learn how to generate text that is coherent.
Finetuning, on the other hand, involves taking a pre-trained language model and training it on a smaller, more specific dataset to adapt it to a particular task. Finetuning allows the model to be customized for a specific use case, which can result in better performance on that task.
If you think of this as the process of building a house, pre-training can be compared to the process of building its foundation and basic building blocks. Just as a strong foundation is necessary for a house to stand, pre-training is necessary to build a solid foundation for a language model. Finetuning, on the other hand, focuses on customizing that house with specific features, which can differ based on the exact needs and preferences of a person.

At Cohere, we refer to our pre-trained models as baseline models (at the time of writing, these are xlarge
, medium
, command-xlarge-nightly
, and command-medium-nightly
). We refer to finetuned models as custom models.
Training a Custom Model
To train a custom model, there is only one thing we need to prepare, and that is the training dataset. The training dataset contains examples of what we want the model to output, given an input prompt. The format is the same as the "prompting by example" format we saw in Part 1.
The difference here is that we will put all these examples in a `txt` file, and most importantly, we need to include a lot of examples.
How many examples are needed? There’s no one-size-fits-all answer to that question as it depends on the type and complexity of your task. You can get started with as few as 32 examples (the minimum the platform accepts) but for the best performance, try experimenting in the region of hundreds or thousands of examples, if you have access to the data needed.
Note that finetuning with the Cohere Platform is free, so you can create multiple custom models without worry. There is a price difference though for calling the model, but with the free developer trial key, you don’t have to worry about this either.
This article comes with a Google Colaboratory notebook for reference.
Prepare Your Dataset
Let’s start with the problem we want to solve. Here, our task is to take a request coming from a human and rephrase it into the most accurate utterance that an AI virtual assistant should use.
We’ll use the dataset from the paper, Sound Natural: Content Rephrasing in Dialog Systems (Einolghozati, et al.). We’ll take the `train.tsv` portion containing 2,243 examples.
The dataset table contains a number of columns, but to keep our example simple, we’ll need just the first (the human’s request) and last (the virtual assistant’s utterance) columns. Here is one example:
First column (the request):
Send message to supervisor that I am sick and will not be in today
Last column (the utterance):
I am sick and will not be in today
The dataset comes in tsv
format, so we’ll need to do some pre-processing steps to get it to the txt
format that the platform accepts.
Inside the txt
file, we will need to format the text the same way we did in Part 1, where each example is followed by a separator, which can be any sequence of characters you choose. In this case, we use --
. We’ll need to specify this separator when initiating the training later.
The text file (you can get it here) now looks as follows, with 2,243 examples altogether.
Request: Ask my aunt if she can go to the JDRF Walk with me October 6th
Utterance: can you go to the jdrf walk with me october 6th
--
Request: Ask Eliza what should I bring to the wedding tomorrow
Utterance: what should I bring to the wedding tomorrow
--
Request: Send message to supervisor that I am sick and will not be in today
Utterance: I am sick and will not be in today
--
...
Note that here, I have added my own prefix of Request
and Utterance
in each example. As for the choice of words, it is more of a personal choice — there is no right or wrong approach. But as for the format, having a prefix followed by a colon can definitely help the model capture the pattern that we want it to generate.
Initiate Training
We can now begin training our own custom model. We can do this easily via the dashboard, and there is a comprehensive step-by-step guide in our documentation.
Because of this, we’ll not cover those steps in this article. We’ll only look at a couple of screenshots that show the steps with our dataset applied.
Here is the step where we upload the text file and define the separator:

And here is the preview of the training dataset and its count:

The training will take some time, and once it’s done, you will receive an email mentioning that it is deployed and ready. If you’ve reached this point, congratulations!
Evaluating a Custom Model
Using a custom model is as simple as substituting the base model with the model ID (replace the ID shown below with your model ID).
response = co.generate(
model='a853218c-30ef-4b3d-83be-b037d669029b-ft',
prompt='Request: Send a message to Alison to ask if she can pick me up tonight to go to the concert together
Utterance:')
We can get the model ID from the dashboard by clicking on the Model ID
button highlighted below.

But of course, we need to know if this model is performing better than the baseline in the first place. For this, there are three ways we can evaluate our model.
Check Metrics
When you go to the custom model’s page, you can see two metrics shown: Accuracy
and Loss
. Here are the definitions of what they mean, taken from our documentation.
- Accuracy — This measures how many predictions the model made correctly out of all the predictions in an evaluation. To evaluate Generate models for accuracy, we ask it to predict certain words in the user uploaded data.
- Loss — This measures how bad or wrong a prediction is. Accuracy may tell you how many predictions the model got wrong, but it will not describe how incorrect the wrong predictions are. If every prediction is perfect, the loss will be 0.
An indication of a good model is where the Accuracy
increases and the Loss
decreases. And in our case, where Accuracy
is 71.77% and Loss
is 1.38, the model is performing well!

Make Sample Calls
The previous metrics are a good first indication of the model’s performance, but it’s good to make some qualitative assessments as well. So what we can do now is to make a few calls to both the baseline and custom models, and compare the results. For this, we can reuse the prompt experimentation code we created in Part 3.
Let’s take an example. We’ll run the following request through a range of temperature
values and see what the utterances look like. This comes from the test dataset, which the model has not seen before.
Send a message to Alison to ask if she can pick me up tonight to go to the concert together
And the ground truth utterance for this, provided in the test dataset, is the following.
can you pick me up tonight to go to the concert together
For the baseline model, we use the xlarge
model and create the following prompt.
Request: Ask my aunt if she can go to the JDRF Walk with me October 6th
Utterance: can you go to the jdrf walk with me october 6th
--
Request: Ask Eliza what should I bring to the wedding tomorrow
Utterance: what should I bring to the wedding tomorrow
--
Request: Send message to supervisor that I am sick and will not be in today
Utterance: I am sick and will not be in today
--
Request: Send a message to Alison to ask if she can pick me up tonight to go to the concert together
Utterance:
Note: At the time of writing, you cannot yet create a custom model on top of the command
models, hence we change the model type to xlarge
for a one-to-one comparison with the custom model.
And as for the custom model, we use the custom model ID and create the following prompt.
Request: Send a message to Alison to ask if she can pick me up tonight to go to the concert together
Utterance:
Note: Since the model is already trained with our custom data, the prompt doesn’t need to contain examples as we have with the baseline models. We can go straight to the immediate task the model needs to perform.
We make API calls for the two models through three temperature values (0.0, 0.5, and 1.0) and three generations each, and here are the responses.
Baseline model:
----------
Temperature: 0.0
----------
Generation #1
Text: can you pick me up tonight to go to the concert together
Generation #2
Text: can you pick me up tonight to go to the concert together
Generation #3
Text: can you pick me up tonight to go to the concert together
----------
Temperature: 0.5
----------
Generation #1
Text: can you pick me up tonight to go to the concert together
Generation #2
Text: can you pick me up tonight to go to the concert together
Generation #3
Text: Send a message to Alison to ask if she can pick me up tonight to go to the concert together
----------
Temperature: 1.0
----------
Generation #1
Text: Can you pick me up tonight to go to the concert together
Generation #2
Text: can you pick me up tonight to go to the concert together
Generation #3
Text: Alison can you pick me up tonight to go to the concert together
Custom model:
----------
Temperature: 0.0
----------
Generation #1
Text: can you pick me up tonight to go to the concert together
Generation #2
Text: can you pick me up tonight to go to the concert together
Generation #3
Text: can you pick me up tonight to go to the concert together
----------
Temperature: 0.5
----------
Generation #1
Text: can you pick me up tonight to go to the concert together
Generation #2
Text: can you pick me up tonight to go to the concert together
Generation #3
Text: can you pick me up tonight to go to the concert together
----------
Temperature: 1.0
----------
Generation #1
Text: can you pick me up tonight to go to the concert together
Generation #2
Text: can you pick me up tonight to go to the concert together
Generation #3
Text: can you pick me up tonight to go to the concert together
With the baseline models, the output gets the correct response at lower temperature values, but as we start to bring it higher, it gets inconsistent. Whereas with the custom models, it gets the response correct even at higher temperatures.
This indicates that the custom model option has greater predictability and can produce quality outputs consistently, something that’s much needed when deploying applications out there.
Measure Likelihood
The third way to compare models is to use likelihood to measure how surprising a string of text is. This is detailed further in the documentation.
In a nutshell, this time, we are not evaluating likelihood on a piece of generated text. Rather, we evaluate it on an existing piece of text and measure how surprising it is to see the sequence of text within it. In our case, this consists of a complete example (or it can be a number of them), that is the request and the expected utterance, such as this one:
Request: Send a message to Alison to ask if she can pick me up tonight to go to the concert together
Utterance: can you pick me up tonight to go to the concert together
The lower likelihood is better, which indicates that the model sees this text sequence as less surprising, hence it is more likely to generate it.
And with our models, we get the following likelihood values, which again, point to the custom model performing better at this specific task.
- Baseline model: -2.58
- Custom model: -1.37
Final Thoughts
Custom models are a powerful concept when working with the Cohere Platform. You get that nice balance of the abstracted complexity of a managed LLM, but you still retain a degree of flexibility when elevating its performance for a specific task. Even if you are experimenting with basic demos, and especially if you are considering moving up to production, this is a useful option.
Start training your custom models from your Cohere dashboard.