I still get a kick out of playing with Large Language Models. Outside of technical circles, many people don’t know about them, so it’s even more fun when I show them my conversations with language AI. To them, it seems incredible, but the reality is that a language model is a prediction engine.
Here’s how it works. The model takes a string as an input (the prompt) and then predicts what the following words should be. Behind the scenes, it comes up with probabilities for the various permutations and combinations of words that could follow. The output of the Generate model is a giant list of possible words and their probabilities. It returns only one of those words based on the parameters you set.
In this post, we’re going to cover what those parameters are and how you can tweak them to get the best outputs.
The performance of a pre-trained language model depends on its size. The bigger the model, the better the quality of the output, but it comes at the cost of speed and money.
As a Cohere user, you have access to four models, aptly named small, medium, large, and xlarge. You can read more about them here.
The smaller models are cheaper to use and return an output faster than the larger models. However, they’re not as powerful, so they’re better used for simpler tasks, like classification, while the larger models are useful for creative content generation.
You may also want to consider fine-tuning a smaller model for particular tasks, like sentiment analysis of tweets. This allows you to balance getting a more accurate output with lower costs and faster speeds.
Number of Tokens
I said earlier that the language model builds a list of words and their probabilities as outputs. This is technically incorrect. It builds a list of tokens, which is roughly 4 characters, but not always.
For example, a word like “water” might end up being one token, whereas larger words might be broken up into multiple tokens. At Cohere, we use byte-pair encoding to create tokens.
You probably don’t want the language model to keep generating outputs ad infinitum, so the number of tokens parameters allows you to set a limit to how many tokens are generated.
There’s also a natural limit to the number of tokens the model can produce. Smaller models can go up to 1024 while larger models go up to 2048.
It’s not recommended to hit those limits though. If you’re generating content using a large limit, the model may go off in a direction you’re not expecting. It’s generally recommended to generate in short bursts versus one long burst.
Temperature is a close second to prompt engineering when it comes to controlling the output of the Generate model. It determines how creative the model should be.
Consider the phrase “The sky is”. When you read that, you probably think the next word will be “blue” or “the limit”. You’re also unlikely to think the next word will be “water” or “tarnished”. You’ve subconsciously created predictions for words that can follow and determined that “blue” is most likely while “tarnished” is least likely (unless you play a lot of Elden Ring).
And that’s essentially what the Generate model does. It has probabilities for all the different words that could follow and then selects the next word to output. The Temperature setting tells it which of these words it can use.
A Temperature of 0 makes the model deterministic. It limits the model to use the word with the highest probability. You can run it over and over and get the same output. As you increase the Temperature, the limit softens, allowing it to use words with lower and lower probabilities until at a Temperature of 5 it’s biased towards lower probabilities, and it might generate “tarnished” if you run it enough times.
Running Generate on the prompt “The sky is” at a Temperature of 0 gives us the same output each time - The sky is the limit with this one.
At Temperature 0.5, we see more variety, although still fairly standard -
The sky is the limit
The sky is blue
The sky is overcast.
At Temperature 1, things start getting interesting, but nothing too crazy -
The sky is not the limit
The sky is almost perfectly blue
The sky is grey and dreary today
At Temperature 5, the highest setting, we enter the realm of fantasy -
The sky is clear, the water smooth, and it's an unimaginably long way to go before the dolphins decide to give up their vertical quest.
Top-k and Top-p
Aside from Temperature, Top-k and Top-p are the two other ways to pick the output token.
Top-k tells the model to pick the next token from the top ‘k’ tokens in its list, sorted by probability.
Consider the input phrase - “The name of that country is the”. The next token could be “United”, “Netherlands”, “Czech”, and so on, with varying probabilities. There may be dozens of potential outputs with decreasing probabilities but if you set k as 3, you’re telling the model to only pick from the top 3 options.
So if you ran the same prompt a bunch of times, you’ll get United very often, and you’ll get a smattering of Netherlands or Czech, but nothing else.
If you set k to 1, the model will only pick the top token (United, in this case).
Top-p is similar but picks from the top tokens based on the sum of their probabilities. So, for the previous example, if we set p as 0.15, then it will only pick from United and Netherlands as their probabilities add up to 14.7%.
Top-p is more dynamic than top-k and is often used to exclude outputs with lower probabilities. So if you set p to 0.75, you exclude the bottom 25% of probable outputs.
A stop sequence is a string that tells the model to stop generating more content. It is another way to control how long your output is.
So, for example, if I prompt the model with “The sky is” and I enter a full stop (.) as a stop sequence, the model stops generating text once it reaches the end of the first sentence, even if the number of tokens limit is much higher.
This pairs well with prompts where you include a couple of examples. So let’s say you want to generate text in a certain pattern, you add a certain string to the examples, and then use that string as a stop sequence.
So in this example, we want the model to stop after generating a hashtag. We don’t want it to keep going and generate new posts and hashtags. So we split our examples with the string ‘--’ and use that as our stop sequence. You can see what I mean if you try this in the Cohere playground with and without the stop sequence.
Frequency and Presence Penalties
The final set of parameters is the frequency and presence penalties.
The frequency penalty penalizes tokens that have already appeared in the preceding text (including the prompt), and scales based on how many times that token has appeared. So a token that has already appeared 10 times gets a higher penalty (which reduces its probability of appearing) than a token that has appeared only once.
The presence penalty applies the penalty regardless of frequency. As long as the token has appeared once before, it will get penalized.
These settings are useful if you want to get rid of repetition in your outputs.
Play With The Parameters
There is no right or wrong way to set these parameters when you’re generating text with Large Language Models. It often depends on what you’re trying to achieve with the model, and the best way to figure out what set of parameters to use is to experiment.
In many cases, trying out different Temperature settings is good enough and you won’t need to touch the other parameters. However, if you have some specific output in mind and want finer control over what the model generates, start using the top-k, top-p, and penalties to get it just right.
The Cohere Playground is the perfect place to experiment and figure this out before you deploy it to production.