In this talk, Dr. Rachael Tatman introduced us to Cohere LLMs and walked us through the reasons behind using them for data augmentation. She explained several strategies and the problems that we may face while augmenting our models. She also offered advice and recommendations for performing data augmentation with a focus on data diversity while also demonstrating a few examples using the Cohere LLM and how it gives us a warm start for our models.
This article’s title and TL;DR have been generated with Cohere.
Get started with text generation
Discover some tips and techniques for creating chatbots with LLMs. In this video, Rachael Tatman, a language technology educator, offers some advice and ideas for developing chatbots with LLMs. She details the what, why, and how of data augmentation while illustrating how we may validate the created data and add diversity to our data. One of her goals is for everyone interested in NLP to be able to build reliable, useful language technology tools that genuinely improve people’s lives.
Advice and Recommendations for Using LLMs:
Dr. Rachael Tatman started her talk by strongly advising against serving raw-generated text to users for both UX and security reasons because of its unpredictability. She also mentions that most of the adversarial attacks on the LLMs require access to raw text output. So, if we don’t release the raw data, we won’t have to deal with adversarial attacks. She recommends human-in-the-loop data augmentation for a warmer start when training or fine-tuning a chatbot.
When Is Data Augmentation Helpful?
Data augmentation is beneficial when we don’t have representative data for all our targeted user personas. She also states that data augmentation is fruitful when we have representative data, but there are missing examples for specific intents. For instance, it happens when a new topic suddenly becomes relevant. In addition, data augmentation is crucial when working with some research data that is typically very clean and does not represent user-generated text.
Why Do We Use LLMs Over Other Strategies?
We can avoid repetition and unintended errors by using LLMs over other techniques like templatic-rule-based data augmentation. Besides, the templatic-rule-based approach lags in generating data with different syntax. She also says that LLMs are more efficient than other translation-based approaches that are usually full of errors. In addition to that, using LLMs is a faster, cheaper, and more reliable approach than generating all new data.
While LLMs also represent noisy user-generated text–so, while training our model, it will undoubtedly benefit from such a diverse range of data.
How Did Cohere Collect the Data for Training the Model?
Cohere’s Generation Large Language Model is trained on the Google Books dataset, CommonCrawl, and other text from the internet scraped by the Cohere infrastructure team [see: Generation Model Card]. The top ten domains scraped by the infrastructure team at Cohere include: wordpress.com, medium.com, stackexchange.com, tumblr.com, elsevier.com, genius.com, bbc.co.uk, libsyn.com, yahoo.com, nytimes.com. Based on this, Cohere LLMs have used various data to train the model, including noisy data.
How to Build Chatbots Faster With LLMs?
Now, depending on your specific case, you may have variations in what you actually choose to do, but here are some recommendations.
It’s recommended to work with the data you have and use that to train the model using prompt engineering to generate new data. In this example, Dr. Tatman walks us through an example she created by using the SLURP dataset (Bastianelli et al. 2020). She takes this data because it is very clean and fairly formal.
Example: We have some training data. Now, how we can use this training data to generate more data based on it. The following image is a screenshot of Cohere's playground. In the following example, we are giving it the intent:
Play Music. With that, we are supplying it with a bunch of examples. When we click the generate button, it will generate relevant text.
The following image demonstrates another example of using the Cohere playground to generate text. Here, we are feeding it with an intent, for instance, setting an alarm or a reminder.
How to Add Diversity to Our Data?
So far, we have seen the ways of increasing the data we have using data augmentation techniques. But, the generated data was similar to the existing data. What if we want to add data diversity? Dr. Tatman divides the approaches to adding diversity into two parts.
Prompts Based on Emotions or Use Persona (More Risky):):
Prompts Based on Emotions:
- Prompts based on emotions have several problems. She gives the following examples to explain it.
- For instance, when asking a chatbot to play music angrily, she found that the intent of playing the music was changed and the chatbot generated text suggesting turning off the music. She said that the chatbot developed the text precisely opposite to our intent. So, she says that the emotional context and intent are not IID(Independent and Identically Distributed). However, she says this approach may be suitable for generating a dataset for negative emotions.
Prompt Based on Specific User Persona
- While using the specific user-based persona, she found it mainly based on stereotypes. She adds to that by saying that people are unlikely to introduce themselves by demographics unless they want to introduce some stereotype based on that. However, she mentions that one possible exception is using multilingual data; we should proceed cautiously with this approach.
- The following two screenshots generated using the Cohere background demonstrate this approach.
Prompts Based on Website Demographics (Less Risky)
Prompt by Referencing Specific Websites
- In this talk, she offers a less risky approach to creating prompts using demographics of social media sites. She mentions that this approach can be used as a proxy for personas in prompting. She also adds that this approach considers the effect of topics as well.
- The following example shows the generated prompts when someone on StackOverflow asks a bot to play some music.
- The following example shows the generated prompts when someone on Facebook asks a bot to play some music.
- The following example shows the generated prompts when someone on YouTube asks a bot to play some music.
Some General Notes and Advice for Using LLMs:
For instance, the above approaches won’t work very well if our intents are too specific or unique. The provided approaches will work best if your target users are a large chunk of existing social media users. Also, adding that data diversity in the given way does not represent your actual users. It is basically a stopgap that gives us a warmer start.
How to Validate the Generated Data?
It’s recommended to do hand validations in the first pass. She asks to add a human-in-the-loop for better performance. Other than that, she recommends using embedding visualizations to ensure a mix of real and generated data across the distribution. You can also figure out whether you’re happy with your new clusters using embedding visualizations.
To summarize this talk, LLMs can help us with data augmentation by volume and diversity until we get some actual data. It will give us a warm start to make our system more usable. She adds that we can prompt with our existing and newly generated data. Finally, she recommends hand-verifying the generated data in the first pass to ensure it is up to the standards and quality we are looking for.