Develop, test, and experiment with the industry’s first multilingual text understanding model that supports 100+ languages

Humans speak over 7100¹ languages, yet the majority of language models only support the English language. This makes it incredibly challenging to build products and projects using multilingual language understanding. Cohere’s mission is to solve that by empowering our developers with technology that possesses the power of language. That’s why today we’re introducing our first multilingual text understanding model that supports over 100 languages and delivers 3X better performance than existing open-source models. This will enable new markets, countries, and global companies to better serve their customers across the globe.

What is a Multilingual Text Understanding Model?

Embedding models translate text into numeric representations

This enables advanced language understanding capabilities like searching by meaning and categorizing text.

Multilingual text understanding models are powerful models that can derive insights from text data across languages. At Cohere, we’ve trained our model specifically to be used for search, content aggregation and recommendation, and zero-shot cross-lingual text classification.

While many of these models are available for English, similar existing multilingual models only work well for short sentences and can’t capture the meaning behind longer text. This prevents them from being used for semantic search, which typically aims to match a short query with a longer, relevant document.

In this blog post, we will cover three relevant use cases that showcase the power of Cohere’s new multilingual model:

Multilingual Semantic Search: To improve the quality of search results, Cohere’s multilingual model can produce fast, accurate results regardless of the language used in the search query or source content.
Aggregate Customer Feedback: Cohere’s multilingual model can be deployed to organize customer feedback across hundreds of languages, simplifying a major challenge for international operations.
Cross-Lingual Zero-Shot Content Moderation: Identifying harmful content in online global communities is challenging. By training Cohere’s multilingual model with a few English examples, it can then detect harmful content in 100+ languages.

How Does the Multilingual Text Understanding Model Work?

Cohere’s multilingual text understanding model maps text to a semantic vector space (also known as “embeddings”), positioning texts with a similar meaning in close proximity. This process unlocks a range of valuable use cases for multilingual settings. For example, one can map a query to this vector space during a search to locate relevant documents nearby. This often yields search results that are several times better than keyword search.

To train multilingual models, you need large quantities (hundreds of millions) of suitable training pairs, like question/answer pairs. So far, such training data has been primarily available in English, and prior work tried to use machine translation to map it to other languages. However, these models don’t capture the nuances behind language usage in different countries.

Contrary to this approach, we collected a dataset of nearly 1.4 billion question/answer pairs across tens of thousands of websites in hundreds of languages. These are questions actually asked by speakers of said languages, allowing us to capture language- and country-specific nuances.

“I strongly believe that embeddings are the future of search and recommendation. Thanks to the new Cohere multilingual model and the text2vec Cohere module in Weaviate, we can bring this to developers worldwide with a single command.”
- Bob van Luijt, CEO at SeMI Technologies

So, today we are happy to release our first multilingual embeddings model: multilingual-22-12!

The multilingual-22-12 model can be used to semantically search within a single language, as well as across languages. Compared to keyword search, where you often need separate tokenizers and indices to handle different languages, the deployment of the multilingual model for search is trivial: no language-specific handling is needed — everything can be done by a single model within a single index. We’re extremely proud of the performance of our multilingual understanding model. It outperformed the industry standard (the next best model) in search tasks by more than 230%.

Benchmarks

We extensively benchmarked our new model to ensure the best performance across a wide range of applications, domains and languages. Specifically, we used:

Clustering: We benchmarked nine task/datasets across 12 languages using MTEB. Performance is measured using v-measure.
English Search: We benchmarked on eight datasets from BEIR, the industry standard benchmark for evaluating the search capabilities of models with a special focus on out-of-domain performance (i.e., without seeing training data for these tasks). For search, we use nDCG@10.
Multilingual Search: We benchmarked on two datasets from BEIR, 10 datasets from Mr. Tydi, and 14 datasets from MIRACL. The benchmark consists of 16 languages from various language families and alphabets: Arabic, Bengali, Finnish, French, German, Hindi, Indonesian, Japanese, Korean, Persian, Russian, Spanish, Swahili, Telugu, Thai, and Vietnamese. All of these benchmarks have been created by native speakers on original text.
Cross-Lingual Classification: We tested how well our model can learn from English training data and then applied that understanding to text classification problems in other languages. Here, we used the Amazon MASSIVE dataset, which contains utterances for 60 intents in 51 languages for Amazon Alexa. We trained with 10 English examples per intent, and then computed accuracy for the 50 other languages. The models hadn’t seen any training data in other languages.

We compared our results against other state-of-the-art multilingual embedding models, specifically paraphrase-multilingual-mpnet-base-v2 (the best model from Sentence-Transformers), LaBSE (from Google), and Universal Sentence Encoder cMLM (from Google). The following chart shows how they compare:

The Cohere multilingual-22-12 model performs much better in all use cases. In particular, we see a robust improvement in multilingual search. The other models we tested against perform rather poorly, in many cases less effectively than keyword search. The main reason is that these models have just been trained at a sentence level, and they are not able to produce meaningful embeddings for longer text, like paragraphs.

Use Case 1: Multilingual Semantic Search

Traditional keyword search has its limitations in that it often doesn’t find the relevant information that matches the user’s search intent. For instance, the simple search query, “What is the capital of the United States?” produced the following results:

To further exacerbate the problem, keyword search with Elasticsearch ranks an article about Capital Punishment at the top position, as it contains many instances of the words capital, united, and states. The second and third-ranked results are also not much better. These are articles about Ohio and Nevada, which are states in the United States and they also have a capital.

By contrast, search quality can be greatly improved by using semantic search powered by Cohere’s multilingual model. In this example, the relevant article about Washington, D.C., is ranked at the top position of the search results. And such a gain in search quality is found not only with English queries, but also for a wide range of languages that we have tested (see benchmark results).

Semantic search does not restrict itself to queries and documents in the same language, but it also works across languages. For example, if we phrase the search query in Arabic (“ما هي عاصمة الولايات المتحدة؟”), we get the same results, while keyword search can obviously not retrieve any relevant documents.

The multilingual flexibility of semantic search enables interesting use cases in industries like finance, where users need to quickly find information that may be published across multiple languages.

“At ML6, we see that multilingualism remains a major challenge in an English-centric NLP landscape — especially in Europe. Naturally, we are actively on the lookout for solutions and have been impressed by what we’ve seen from Cohere thus far!"
- Matthias Feys, Co-Founder & CTO at ML6

Use Case 2: Customer Feedback Aggregation

When successful products like the iPhone launch, tens of thousands of users around the world post their feedback (in their own language) on eCommerce sites, social media, blogs, and elsewhere. Extracting insights from these reviews enables companies to quickly respond to the market, better understand their customer base, and improve their product roadmap.

However, previous methods for content aggregation have only worked well for English, and they didn’t allow users to see patterns across languages or to compare feedback from different markets.

Cohere’s multilingual model maps text in different languages to the same vector spaces, allowing users to derive insights across languages and find patterns for specific markets (e.g., which markets care about the picture quality of smartphones).

Using thousands of commands directed at a digital assistant, we created a representation demo and video to demonstrate how you can visualize and pull insights from multilingual customer feedback.

Use Case 3: Cross-Lingual Zero-Shot Content Moderation

In today’s world, content moderation remains a major challenge. Social platforms like online gaming are attracting a wider international audience, which increases the complexity of content moderation efforts. As hateful content makes its way across multiple languages, it has a greater probability of passing through current content moderation tools that only catch English comments.

To tackle this challenge, platforms can now use Cohere’s multilingual embeddings model to build a content moderation tool that works across 100+ languages and only requires training data in English.

For the content moderation use case, the model just needs a handful of training examples in one language that demonstrate harmful and acceptable content. Developers can then train a classifier (in English or a language of their choice) to find the decision boundary in the vector space that helps determine which type of content — in 100+ languages — is undesirable on their platform.

The following demo showcases Cohere’s multilingual model used to build a recommendation movie engine and obtain relevant results regardless of the language used in the search query or the content source. In addition, it demonstrates multilingual sentiment analysis classification across a wide variety of languages

Test out the multilingual search and recommendation demo, multilingual sentiment analysis classification demo, and watch a demonstration:

Getting Started with Cohere’s Multilingual Model

To get started using Cohere’s multilingual model, just create a free account and get your API key. You can then either query our REST API endpoints or install our SDK to use the model within Python.

import cohere
co = cohere.Client(f"{api_key}")  # You should add your API Key here :))
texts = [
   'Hello from Cohere!', 'مرحبًا من كوهير!', 'Hallo von Cohere!',
   'Bonjour de Cohere!', '¡Hola desde Cohere!', 'Olá do Cohere!',
   'Ciao da Cohere!', '您好，来自 Cohere！', 'कोहियर से नमस्ते!'
]
response = co.embed(texts=texts, model='multilingual-22-12')
embeddings = response.embeddings # All embeddings for the texts
print(embeddings[0][:5]) # Let's check embeddings for the first text

The following video navigates through the Cohere Platform to select the multilingual model, and shows how the multilingual model can embed text from multiple languages into the embedding space.

Additionally, we have added to the Cohere Sandbox, a collection of experimental, open-source GitHub repositories that make building applications using large language models fast and easy with Cohere. You can find an example of how to use the Cohere API to build a multilingual semantic search engine. The search algorithm used in this project is fairly simple: it finds the paragraph which most closely matches the representation of the question using the co.embed endpoint.

Training Data

Training embedding models require data in a specific format. For example, question/answer pairs or title/document pairs. We can then learn which text pairs should be closer together in the vector space in order to enable applications like semantic search.

To train Cohere’s new multilingual model, we processed and carefully cleaned terabytes of data from various sources: Wikipedia, news publications, scientific articles, and online communities across hundreds of languages. This resulted in a large training corpus of more than 900 million training pairs for English and 450 million training pairs for other languages.

Other multilingual embedding models often rely on machine translation for training dataset creation, which creates an awkward bias for these models. A lot of existing English training data has a focus on topics that are primarily interesting for U.S. citizens, for example, how to fill out specific U.S. tax forms. If these question/answer pairs are then translated into another language, e.g., Korean, the model learns in Korean how to file taxes in the U.S., but it doesn’t learn how to file taxes in Korea, a topic that is likely more relevant for Korean citizens. This makes prior models rather suboptimal for multilingual semantic search, as they don’t capture country-specific interests well.

Our training process included actual authentic question/answer pairs from users across hundreds of languages from tens of thousands of websites from hundreds of countries. This is what makes the Cohere multilingual model so powerful: it has seen thousands of topics in each language.

Final Thoughts

At Cohere, we are committed to breaking down barriers and expanding access to cutting-edge NLP technologies that power projects across the globe. By making our innovative multilingual language model available to all developers, we continue to move toward our goal of empowering developers, researchers, and innovators with state-of-the-art NLP technologies that push the boundaries of language AI.

Sign up and try our new multilingual model for free. If you would like to discuss your multilingual use case, please don’t hesitate to contact us.

1. “Ethnologue: Languages of the World,” Ethnologue, 2022 (accessed Nov. 25, 2022).