Nils Reimers on the Future of Semantic Search

Nils Reimers on the Future of Semantic Search

Nils Reimers, Director of Machine Learning at Cohere, shares his vision for search in the enterprise.

All businesses use text to transfer and preserve internal company knowledge. A major challenge is that this knowledge is represented as unstructured data in emails, word documents, meeting notes, or internal wiki systems. In most cases, this unstructured format makes it difficult (or sometimes impossible) for employees to use common search technologies like keyword search to access valuable sources of internal knowledge. This is often a highly inefficient use of time that may even unknowingly duplicate work.

Imagine if companies and employees could more easily use internal company knowledge — how much more productive each could be and how much time could be saved. A big cornerstone for this is search, where we attempt to retrieve information relevant to our needs. A second cornerstone is content discovery, where intelligent systems recommend content that complements our knowledge related to a current task.

I’m extremely passionate about increasing the accessibility of search technologies and the ability to get good search results with minimal effort, especially with the usage of large language models like the Transformer network. This has been a long-standing research focus of mine, where I helped the field discover the pros and cons of this new technology and its application to search.

What caught my eye about Cohere was its focus on making language technology more accessible to developers. Deploying semantic search solutions still requires a lot of expert knowledge, especially once the data gets a bit more complex. My goal at Cohere is to make this exciting new technology as accessible as possible, which requires addressing some key challenges.

A big issue with lexical search is the lexical gap. If a user searches for “United States,” then relevant documents mentioning “U.S.” will not be found. This leads to a lot of frustration for users as they have to remember the exact phrasing to find the information they’re looking for. Imagine a lawyer that works at a firm with hundreds of thousands of contracts and wants to scan them for information about non-disclosure agreements (NDA). It would be impossible to remember the exact phrases needed to find the relevant information. Luckily, with semantic search, it is easy to find such information.

Semantic search is a search technique that is focused on understanding the intent of a user’s query and trying to find the most relevant documents for this intent. So, instead of retrieving just the documents that have some word overlap with the search query, semantic search retrieves results that are actually useful for the user. In many cases, this leads to far better search results, and users find information more quickly.

The most common technique for semantic search uses vector spaces. Here, documents are mapped to a semantic vector space to represent the knowledge stored in the text. When a query is made, semantic search maps the query to the same vector space, and it can then retrieve the relevant documents.

In contrast to keyword search, the focus is on the actual semantic content of documents. For example, if the user searches for “United States,” semantic search will also find documents talking about “U.S.” and “U.S.A.” Similarly, when the lawyer searches for “NDA,” the language model that powers semantic search will find all contracts and paragraphs with relevant information on this topic.

The Massive Opportunity

Until recently, there was little progress in search quality over the past 20-years and keyword search was the best system available. Larger search engines like Google were able to generate a lot of training data from click interactions from millions of users, which powered their search systems. But for applications without millions of users, we saw little to no progress in improving search quality.

This completely changed with the introduction of the Transformer architecture in 2018. For the first time, we saw substantial quality improvements across many search tasks using little to no training data. Now, you can have a search engine that works as well as Google on your internal data.

This opens a huge untapped opportunity for semantic search for the enterprise. Semantic search can prove to be extremely beneficial across various industries, from manufacturing to finance. Any field that deals with internal documents can find significant value in better search capabilities.

However, applying these technologies to get a performant semantic search engine can still be challenging, especially when documents become longer and more complex. So far, existing methods only work well on relatively short documents of around 200 words. Also, current methods work only on text, but a lot of documents like financial reports also contain images and graphs. Furthermore, information can also be stored in semi-structured formats like spreadsheets.

What Can We Expect?

How can we efficiently contextualize these long, multimodal complex documents to provide relevant information for the user? How can we take user preferences and prior knowledge into account? How can machine learning help to discover new knowledge and present it to the user in the right format?

While we expect to see real movement and uptake as early as 2023, semantic search is something that will not only change the way we work, but also how we find information. With search, the focus is on helping people get the best results without providing a lot of training data and investment. This is a fantastic real-world, commercial application. Very soon, people will start to realize the power of semantic search, and I’m excited to help them get there.