In the #grounded-qa-bot channel in the Cohere co:mmunity on Discord, there’s a bot that will answer your questions. Ask any question in the chat, add the question mark emoji, and the co_search bot will attempt to answer the question. This answer, however, is not simply a GPT model’s output to an input prompt. It is informed by a web search.

There have been recent large language models (LLMs) from research labs that are trained on using a database or have the ability to search the web for information. Unlike those custom models, this is a bot that you, a developer who does not necessarily work at a giant tech company, can build today.

This post is the recipe.

a user asks a question, the system searches the web, then it uses web search results to generate answer. The answer is then returned to the user. — Language models are better at answering questions when augmented by web search

In this article, we go over the components that make up the co_search bot. It’s a great showcase for building Language AI systems because different steps are tackled by different prompts, as well as the combination of generation and embedding models. It also involves web search via an external search API.

To answer a question, four steps are involved: 1- specify the question and its context using Generate. 2- Search the web for an answer. 3- Identify the most relevant paragraphs in search results. 4- Ask a generative model given the relevant paragraphs. — Language AI systems can be pipelines of steps spanning text generation, embedding, similarity comparison, and other API calls (e.g., search).

Specify the Question and Its Context

By the nature of conversations, the questions we ask tend to reference previous context. A question might be impossible to answer without knowing the context.

Context is vital in conversational settings. Generation models can incorporate the relevant context before searching the web for an answer.

This is why we need to consider a number of the messages preceding the question. But how do we best search for the information we need? It’s not ideal, for example, to use the whole conversation as a search query.

We can, however, have a language model rewrite the question to incorporate the context.

A chat log with a question, a generative model then turns it into a contextualized question. — When a question refers to previous information in the conversation, the generation model can paraphrase the question to incorporate that context.

In conversational settings, a question often references context from previous messages. The first step in the pipeline is for a generation model to create the contextualized question out of the last few messages of the dialog.

The prompt used here uses examples like this:


Nick: How much does a watermelon weigh?
Jim: The average watermelon weighs about 20 - 25 pounds
Nick: where do they grow?
- Contextual search query: where do watermelons grow?

Five examples are in the prompt before presenting the current conversation and question to the model.

Search the Web for an Answer

The bot then hits a search API to search Google with the contextualized question as the query. Possible options for search APIs include duckduckgo-search and SerpApi (which is what we use).

A search query and the top three result pages — The bot does a web search and retrieves the top search result pages.

The bot doesn’t present the search results to the user. Rather, it tries to find the actual answer by grabbing the text of the pages, and using semantic search to try to find the answer. That is done by breaking each result page into multiple sections to find the relevant section.

Three search result pages, each one contains multiple sections. — Each search result contains multiple sections or paragraphs. We'll break those up to zoom-in on the relevant passage answering the question.

Identifying the Most Relevant Paragraphs from Search Results

The bot then embeds the sections of the pages and compares their embeddings to the embedding of an example answer created by the model.

For example, we set the system to look at the top three pages in the search result. The system splits the HTML page into paragraphs (e.g., by using BeautifulSoup to parse the HTML code and get the text inside <p> elements). In this example, we’d have eight paragraphs.

Each of the three pages containing a couple of paragraphs. — Think of the each of the three pages containing a couple of paragraphs.

The system can then embed each paragraph — we can then use those embeddings to find the most relevant paragraph for the question.

The paragraphs are replaced with embedding vectors. — Each paragraph is embedded with the Embed endpoint.

The most intuitive way to find the relevant passages is to compare the embedding of the search query and the embeddings of the paragraphs. This process raises a common question for semantic search systems. Are questions and their answers always semantically similar?

So, instead of comparing the similarity of the question with the paragraphs, the system generates a fake answer, and the embedding of that answer is compared against the paragraphs to identify the most relevant paragraph. Aside from using it for this comparison, the system ignores the fake answer.

Comparing the embeddings of the query and each paragraph — We find the most relevant paragraphs by comparing the similarities between the query and each paragraph.

From this comparison, we arrive at the most relevant paragraphs produced by the search. How many paragraphs to use is a design option for the system. We can, for example, use the top two paragraphs.

We select the most relevant paragraphs (based on similarity scores).

In the next step, we use their text to inform the answer.

Ask a Generative Model

Now that we have relevant results, we can add them to a prompt and pose the question to the generative model.

To answer the question, we add the relevant paragraphs as context to a prompt. They are followed by the question we need to answer.

Because we’re aiming for factual information, we set a low temperature value (0.3) to reduce the creativity of the model in generating this answer.

The response generated by the model is posted as an answer alongside the URLs of the web pages that informed it.

A look back at the four steps of answering the question using a search engine.

System Recap

The previous visual is also a good recap of the components of the system and its steps:

Specifying the question and its context using Generate
Searching the web using a search engine API
Identifying the most relevant paragraphs (using Embed for semantic similarity)
Asking the generative model to answer the question given the context provided in the prompt

An approach like this is expected to produce better factual generations than would be the case when asking a generative model directly. It still will not perfectly answer everything you ask it. It’s also advisable to always be mindful of the responsible use of LLMs.

The Design Space

Notice that this system is not one monolithic model that answers a question. It’s a multi-step pipeline that surgically applies Language AI where it can be useful. That vastly expands the design space of possible intelligent conversational agents that one can build using such pipelines.

What Triggers the Bot to Activate or Answer a Question?

Co-search is triggered when anyone adds the question mark emoji to a message in the Discord channel. Other possible triggers one can set to trigger such a system include:

When the bot is mentioned
When a certain Discord action is typed (e.g., /ask )
Having the bot listen to a channel, and detect questions it can answer (step 1: question classification, step 2: semantic search against an FAQ archive)

Customizing Search Parameters to Limit Results to Certain Pages

Web search APIs have all the options that search engines provide. So, a bot can be restricted to only retrieve from Wikipedia, or university pages, or to a certain specific domain (e.g., Harry Potter Fandom portal for a bot specialized in specific lore).

Verify Answers with Another Prompt

Even supplied with relevant information, a model can provide wrong answers to a question. This is another area where a prompt can determine if a generated answer is supported by the context or not. This brings up the tradeoff of latency and performance. How much is a user willing to wait for an answer in different UI scenarios? Discord users might be a little more forgiving than Siri users who are waiting for an immediate answer. The Discord user experience also allows for another bot to be the fact checker.

Conclusion

This has been a demonstration of a Language AI system that shows some of the possibilities of pipelines using generation and embedding in successive steps. It’s an area we’re super excited about, and we would love to see what you build with it. Be sure to drop in to the Cohere co:mmunity on Discord to ask us questions or to show us anything cool you build with such systems.