Large language models (LLMs) are awesome. We all know it. But when working with LLMs, there are a few challenges to watch out for and common mistakes to avoid. So if you are considering going for LLMs in your setup, this blog post will help you prepare for your mission. We’ve put together a list of five common LLM challenges, and we’ll discuss how best to address them.

In brief, the five common LLM challenges include:

Understanding model limitations
Choosing your model’s endpoint
Finetuning the model to your task
Choosing the right set of parameters
Designing prompt for the model

Ok, let’s dive deeper into each of these.

1. Understanding model limitations is super important.

Large language models have limitations, and understanding them is a necessary step towards successful development. Let’s go through the most common limitations that one can experience when working with LLMs.

Model bias

Language models learn the statistical relationships that are present in their training datasets, and these may include toxic language and historical biases along race, gender, sexual orientation, ability, language, cultural, and intersectional dimensions. At Cohere, we are committed to anticipating and accounting for risks during our development process and creating structures that allow us to quickly mitigate unexpected outputs when they occur. We have also partnered with OpenAI and AI21 Labs to create best practices for any organization developing or deploying large language models. However, despite our ongoing efforts to remove discriminatory, exclusionary, hateful language and the like from the training corpus, our models can generate toxic text or act as if they learned social biases. To illustrate potential biases or gaps, we provide docs on the collection and curation of our dataset.

Generate

Developers using Cohere’s Generation model that powers the Generate endpoint should take model toxicity and bias into account and design applications carefully to avoid the harmful completions that reinforce historical social biases.

Despite our ongoing efforts to remove harmful text from the training corpus, models may generate toxic text. This text may include obscenities, sexually explicit content, and messages that mischaracterize or stereotype groups of people based on problematic historical biases perpetuated by internet communities.

We have put safeguards in place to avoid generating harmful text, but we highly recommend building additional guardrails to ensure that text presented to end users is not toxic or harmful.

Language models also capture problematic associations and stereotypes prominent on the internet and in society at large. They should not be used to make decisions about individuals or the groups they belong to. For example, it is dangerous to use the Generation model outputs in CV ranking systems due to known biases (Nadeem et al., 2020).

Embed and Classify

There is extensive research demonstrating that language model embeddings learn social biases (Bolukbasi et al., 2016; Manzini et al., 2019; Kurita et al., 2019; Zhao et al., 2019). Developers using the Representation model that powers Embed and Classify endpoints should take this into account when building downstream text classification systems. Embeddings may inadvertently capture inaccurate associations between groups of people, as well as attributes, such as sentiment or toxicity. Using embeddings in downstream text classifiers may lead to biased systems that are sensitive to demographic groups mentioned in the inputs. For example, it is dangerous to use embeddings in CV ranking systems due to known gender biases in the representations (Kurita et al., 2019).

Factual knowledge

Models are trained on textual data that was scraped from the Internet and other sources at a specific point in time. Their knowledge about the world comes from this data, which inevitably limits the model’s knowledge of the world. You can prompt our Generate model to produce believable outputs on the artistic body of work by Monet, but it hasn’t actually seen a Monet painting and its knowledge and experience is limited to the textual data it can access. You may also find the model lacking in terms of the knowledge of most recent events, depending on the moment in time it was trained on and whether its latest release includes the updated information you’d like it to refer to. On top of that, the models can simply make up facts in the outputs as they go along if not prompted otherwise.

To make sure that your LLM possesses any specific, up-to-date industry knowledge, we recommend to include the relevant information in your prompt.

Common sense and logic

It may be tempting to assume that behind your LLM’s linguistic finesse, there is a reasoning similar to that of humans. However, LLMs lack human logic and common sense. If you are working with a Generate model, you may experience that the model’s text output does not form a coherent paragraph on the first try. This is because the model is trained to predict the most likely next word in a sequence of text, rather than produce sentences connected in a logical, meaningful way that makes sense for humans. The model may also fail to perform simple reasoning tasks involving basic arithmetics and chronology.

If you’re getting an incoherent output from the model, give it a couple of more tries or experiment with different parameters or a different prompt. Recently developed methods in this field include chain-of-thought prompting, which significantly improves the model’s ability to handle these tasks.

Context window

Your model has a limited “memory span” in which the context window is limited to a specific number of tokens. If you want your model to process text that is longer than the maximum text length it supports, it will split it and fail to connect its parts in a meaningful way.

Be sure to check the maximum token length supported by your mode.

You can learn more about our Generation and Representation models in our model cards and find a more comprehensive list of model limitations in our docs.

2. Choosing your model endpoint(s) is one of the key decisions you’ll make.

How do you decide which endpoint to choose to solve your problem? Instead of choosing a single endpoint, would it be more suitable to create a chain of endpoints in order to make the most of the models? To be honest, sky's the limit here. The answer will depend on your use case. Here’s the scoop on Cohere endpoints:

Generate

This endpoint generates realistic text conditioned on a given input. The Generate endpoint is trained on vast amounts of text spanning all topics and industries. You can use it to solve problems like text summarization and entity extraction.

Embed

This endpoint returns text embeddings. An embedding is a list of floating point numbers that captures semantic information about the text that it represents. Embeddings can be used to create text classifiers, empower semantic search, and cluster large amounts of text.

Classify

This endpoint classifies text into one of several classes. Classify can be very helpful in organizing information for effective content moderation, analysis, and chatbot experiences.

You can dive deeper into the endpoints in our docs. Take time to consider the endpoint choice for your use case, and don’t hesitate to ask questions about it in our Discord community.

3. If you need your model to learn a domain-specific language, finetune it! And do it right.

Finetuning is the process of taking a pre-trained LLM and customizing it with a dataset that will enable it to excel at a specific task. A baseline LLM already comes pre-trained with a huge amount of text data. Finetuning builds on that by taking in, and adapting to, your own training data. Using finetuning can help you achieve the best performance from the model, especially if your use case has a domain specific language and knowledge.

For example, when building a customer support chatbot for a bank, you’ll get the best results if you finetune the model using data that includes key concepts and terms from your industry. This will help the chatbot adjust to the context of successfully interacting with the bank's customers.

Common finetuning mistakes include:

4. Your set of parameters can make or break the project. Tweak them until satisfied.

Finding the optimal set of model parameters may take some experimentation and tweaking. Depending on what task you are trying to accomplish with the model, the key parameters to consider are:

Model size

Models come in sizes ranging from small to extra large. The bigger the model, the more powerful it can be at solving your task, but also the more costly and the more time-consuming it gets to process your query. As a user of the Cohere platform, you are priced based on the model size and number of characters or queries, so you need to work towards the right balance of the price and quality of the output to work within your budget.

Number of tokens

If you are using the Generate endpoint, the number of tokens will determine how much text the model will generate. Depending on the task that you are trying to accomplish with the model, the desired length of the generations will vary. It is common to request more tokens than required and then run additional processing to retrieve the desired output.

You can determine a good number of tokens simply by guessing and checking using our playground.

Temperature

If you are using the Generate endpoint, temperature will help you achieve the right level of creativity in your outputs. Temperature is a parameter that tunes the degree of randomness of the model output, so that the same prompt may yield different outputs each time you hit "generate". The higher the temperature, the higher the randomness of the output.

Read more about temperature in our docs.

5. Designing prompts is learning how to communicate with your model.

Prompt design, also known as prompt engineering, is a key part of working with LLMs for text generation. Your model needs to understand really well what you expect from its outputs. To increase the quality of generation outputs, take time to provide it with enough context, describe your task, and include useful examples in the model.

Common mistakes in designing the prompt include

Not enough context in the task description
Lack of output indicator (letting the model know what kind of output you expect it to generate)
Formatting errors. Make sure to spell check, remove unnecessary spaces, etc.

See our prompt engineering doc for instructions on how to construct the best prompts for your task.

Conclusion

Whether you are at the beginning of your LLM journey, or this ship has already sailed, take your time to explore the different challenges that may cross your path. Let’s recap some of the key takeaways:

Always keep an eye on the model limitations. They are super important to understand in order to keep your project healthy and also because they are shifting fast with the latest research and most up-to-date model releases.
Take your time to think about your project architecture and the different endpoints you may use. Choosing your model endpoint(s) is key.
Finetuning can help you adjust the model to your specific use case. Use it. Wisely.
Tweak your set of parameters until you are happy with the outcome.
Learn the art of conversation with the model with smart prompt design. It will pay off.

That’s it! Best of luck with building something awesome.