Classification is one of the most common use cases of Large Language Models (LLMs). It, however, often requires a large quantity of labeled data, which may be difficult to collect. What can we do in that scenario? Can we create a classifier using a small number of labeled examples?

In comes the problem of few-shot classification. How can we get the most performance out of only a few examples (shots)? In this post, we’ll uncover the qualitative characteristics of both good and bad examples.

Data Quality Matters a Lot

Before we dive deeper into the qualities of good data, we should understand why we’re choosing data as our point of focus. Data is the universal knob we can turn to affect model performance. When you only have a limited number of data points to represent something complex, the examples that you use make a big difference in how the model will perform.

Let’s look at a couple of important datasets that shape this analysis: Stanford Sentiment Treebank v2 (SST2) and Surge AI’s Toxicity Dataset. SST2 is a common academic dataset of movie reviews that are labeled with either positive or negative sentiment. Similarly, Surge AI’s Toxicity dataset contains text depicting either toxic or non-toxic content. In order to isolate the effect of each individual example, we train each model with only two examples per class. The figures below show the results of 100 trained models:

How good is a model that’s only trained with two examples? It depends on the examples. Sampling different examples can lead to accuracies ranging from 45% to 72% for SST2, and 55% to 85% for Surge.

By only tweaking the training data, we see differences in performance ranging from 29% and 32% for SST2 and Surge respectively. These results suggest that we should think carefully about the data that we feed into a model when we don’t have very much.

What’s Good Data, and What’s Bad Data?

Since data is so important, let’s examine what makes samples good or bad for few-shot learning. The following recommendations are based on a manual analysis of the examples that produced the results above.

⚠️ Disclaimer! These suggestions are tailored to few-shot learning and aren’t necessarily true when more data is available.

The Good ✅

Good Examples Are: Simple 🚗

Examples that perform well are simple and straightforward in their descriptions. They should be similar to the first thing that comes to mind when imagining a class. Take this negative sentiment example from SST2:

“clumsy cliché”

It’s quick, to the point, and as a result, it helps to train a well-performing classifier.

Good Examples Are: Consistent 🚉

While it might sound helpful to give nuanced data points, in practice it harms performance. If you only had two examples to show someone what makes a positive review, would you pick the following snippet:

“that ‘alabama’ manages to be pleasant in spite of its predictability and occasional slowness is due primarily to the perkiness of witherspoon (who is always a joy to watch, even when her material is not first-rate) ..."

Of course not! It makes a positive review too challenging to decipher. Did positive mean the use of parentheses, nuanced language, or kind statements?

The Bad 🛑

Bad Examples Contain: Words That Can Be Misconstrued 🤬

Although powerful, these models can sometimes be thrown off by small words that can be taken one way in a vacuum, but differently in context:

“after one gets the feeling that the typical hollywood disregard for historical truth and realism is at work here”

If a human read this, they would easily predict negative sentiment. However, models might get tripped up by words such as “historical” and “truth” during classification.

Bad Examples Are: Idiomatic or Rely on Knowing the Task 📚

Examples that contain words that don’t mean their literal meanings are poor choices for few-shot examples. Also, words that change their connotation based on the task are poor choices. As an example: tediously. When describing the quality of work done by a laborer, this is a positive word. However, when used to describe a film, it means boring or drab.

Bad Examples Use: Negation 🙅

Negation is a common failure point in most natural language processing (NLP) systems, but LLMs are significantly stronger than their predecessors in understanding negation. Even though LLMs are getting better, we should refrain from giving examples that contain negation. For example, see this sentence from SST2:

“state property doesn't end up being very inspiring or insightful.”

If you miss the fact that it says doesn’t instead of does, then you could entirely misinterpret the sentiment.

Bad Datasets Have: Too Similar of Class Structure 🤖

Choosing good classes or a good class structure can make a big difference. If your classes are: “🙁sad,” “😞disappointed,” “😌content,” and “😄joyous,” the downstream classifier will have to pick up on a lot of nuances. To remedy this, one could enforce a class hierarchy and train sub-classifiers to distinguish between challenging classes.

Conclusion

The data we feed in has a profound effect on performance. By collecting and choosing samples carefully, we can find a healthy boost during test time. The data that tends to train the models well is simple, consistent, and clear. To put these findings to the test, try them out in our classification playground!