Best NLP Papers — October 2022

If you work in NLP, it's important to keep up to date with the latest research. In this post, we look at some of the best papers on NLP that were published in October 2022

Best NLP Papers — October 2022


This roundup highlights some interesting NLP papers from October 2022 around language model capabilities.

This article's title and TL;DR have been generated with Cohere.

Get started with text generation.

NLP is evolving at a rapid pace, and every month we discover new capabilities. Large language models, like those built by Cohere, are being used for use cases that we couldn't have imagined even just a few months ago.

In this roundup, we highlight some interesting NLP papers on language model capabilities that were published in October 2022. Topics include recent work from Cohere For AI, different prompting methods for understanding dialogue and humor, use cases like summarization and essay scoring, and what language models learn beyond language.  

Have fun reading these!

Recent Work from Cohere For AI

Improving Intrinsic Exploration with Language Abstractions

Authors: Jesse Mu, Victor Zhong, Roberta Raileanu, Minqi Jiang, Noah Goodman, Tim Rocktäschel, Edward Grefenstett

Reinforcement learning agents have a hard time learning when rewards are few and far between. To solve this, we often use intrinsic rewards, which act as encouragement for the agent to explore its environment.

However, many intrinsic exploration methods rely on state-based novelty measures, which can end up rewarding low-level exploration instead of more abstract skills. In this paper, the authors explore the use of natural language as a way to highlight relevant abstractions in an environment.

Unlike previous work, they're testing to see if language can improve on existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines. So far, language-based variants are outperforming their non-linguistic counterparts by 45-85% across 13 challenging tasks from the MiniGrid and MiniHack environment suites.

Improving Policy Learning via Language Dynamics Distillation

Authors: Victor Zhong, Jesse Mu, Luke Zettlemoyer, Edward Grefenstette, Tim Rocktäschel

Some recent work has shown that it can be helpful to provide language descriptions when learning how to do something in a new environment. However, in environments where the language descriptions are complex, it can be difficult to learn how to match the language to what is happening in the environment. This is because there are usually only a few opportunities to practice, and the rewards for getting it right are often delayed.

In this paper, the authors propose a method called Language Dynamics Distillation (LDD) to address this problem. With LDD, they first train a model to predict environment dynamics based on demonstrations that include language descriptions. Then, they fine-tune these language-aware pretrained representations using reinforcement learning (RL). This allows the model to learn not only how to maximize expected reward, but also how to retain knowledge about how language relates to environment dynamics.

They evaluated LDD on a benchmark of five tasks with language descriptions that present different challenges in generalizing to unseen environments. These tasks are called NetHack, ALFWorld, RTFM, Messenger, and Touchdown. Across all of these tasks, LDD outperformed tabula-rasa RL, VAE pretraining, and other methods that learn from demonstrations, either with or without language descriptions.

Large language models are not zero-shot communicators

Authors: Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, Edward Grefenstette

It's important to be able to understand language in context in order to communicate effectively. Humans are able to do this by using their beliefs and prior knowledge about the world. For example, if someone asks "Did you leave fingerprints?" and we respond, "I wore gloves," they will understand that this means "No."

The authors wanted to see if language learning models (LLMs) could also make this type of inference, known as an implicature. They designed a simple task and evaluated it using different LLMs. They found that most LLMs performed close to random on this task. Models that were adapted to be "aligned with human intent" did better, but there was still a significant gap between their performance and human performance.

This research provides a starting point for further investigation into how LLMs interpret language in context. It can also help guide the development of more pragmatic and useful models of human discourse.

MTEB: Massive Text Embedding Benchmark

Authors: Niklas Muennighoff, Nouamane Tazi, Loïc Magne, Nils Reimers

There's a problem with the way people are evaluating text embeddings. Right now, people are only testing a small set of data from one task. This makes it hard to know if the text embeddings will work well for other tasks, like clustering or reranking. To solve this problem, the authors created the Massive Text Embedding Benchmark (MTEB).

MTEB spans 8 embedding tasks covering a total of 56 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, they were able to establish the most comprehensive benchmark of text embeddings to date. They found that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks.

Other Exciting Papers

Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding

Authors: Maximillian Chen, Alexandros Papangelis, Chenyang Tao, Andy Rosenbaum, Seokhwan Kim, Yang Liu, Zhou Yu, Dilek Hakkani-Tur

Dialogue understanding can be difficult when there is not a lot of data to work with. You need a lot of annotated data to achieve good performance.

In this paper, the authors came up with a way to use large, pre-trained language models and iteratively apply weakly-supervised filters to improve augmentation quality. They put their methods to the test on emotion and act classification tasks in the DailyDialog dataset, and the intent classification task in the Facebook Multilingual Task-Oriented Dialogue dataset.

Results showed that models fine-tuned on their augmented data mixed with a small amount of ground truth data outperform existing state-of-the-art models on both datasets. In fact, for DailyDialog specifically, using only 10% of the ground truth data, they were still able to outperform the current state-of-the-art model, which uses 100% of the data.

This joke is [MASK]: Recognizing Humor and Offense with Prompting

Authors: Junze Li, Mengjie Zhao, Yubo Xie, Antonis Maronikolakis, Pearl Pu, Hinrich Schütze

Humor is subjective; what one person finds funny may not be what another person finds funny. This was first noted by ancient Greek philosophers, who observed that people laugh during comedies as a way of mocking or belittling others. The superiority theory of humor suggests that laughter is a way of showing superiority over other people, either by making fun of their physical defects or by making fun of their shortcomings.

However, this theory also suggests that some humor recognition datasets may include offensive content that could be offensive to certain groups of people. This is undesirable because a machine learning-based NLP system, such as a virtual assistant, should never respond to a user query with offensive content. Therefore, it is crucial to identify, mitigate, and reduce offensive content when modeling humor computationally.

In this paper, the authors found that prompting performs just as well as fine-tuning when there are numerous annotations available. However, prompting enables much better performance in low-resource humor recognition, which is when there are fewer annotations available. The authors also looked at the relationship between humor and offense by applying influence functions to prompting. They found that models could rely on offense to determine humor during transfer.

Mutual Information Alleviates Hallucinations in Abstractive Summarization

Authors: Liam van der Poel, Ryan Cotterell, Clara Meister

Although there has been some progress in the quality of language generated from abstractive summarization models, these models still tend to hallucinate and output content that is not supported by the source document. A number of methods have tried to fix this problem but with limited success.

In this paper, the authors identify a simple criterion under which models are significantly more likely to assign more probability to hallucinated content during generation: high model uncertainty. This finding offers a potential explanation for hallucinations: when models are uncertain about a continuation, they default to favoring text with high marginal probability, i.e., high-frequency occurrences in the training set.

The authors propose a decoding strategy that switches to optimizing for pointwise mutual information of the source and target token when the model exhibits uncertainty. Experiments on the XSum dataset show that this method decreases the probability of hallucinated tokens while maintaining the Rouge and BertS scores of top-performing decoding strategies.

Automated Essay Scoring Using Transformers

Author: Kshitij Gupta

Investigating automated essay scoring has been a long-standing focus in the natural language processing (NLP) community because of its potential applications in both education and business. Recent advances in large, pre-trained models and data augmentation have made significant progress in this area, but many challenges remain.

This work demonstrates the effectiveness of transformer models and data augmentation for automated essay grading across a variety of topics. The findings show that transformer models are a promising approach for automated essay scoring, and they suggest avenues for further research.

What Do Large Language Models Learn Beyond Language?

Authors: Avinash Madasu, Shashank Srivastava

Large language models play an important role in natural language processing. These models are trained on large amounts of text, and they are known to acquire rich linguistic knowledge from this training.

In this paper, the authors consider whether pretraining on text also gives these models helpful "inductive biases" for non-linguistic reasoning. They test this by training models on a set of 19 diverse non-linguistic tasks involving quantitative computations, recognizing regular expressions, and reasoning over strings.

The authors found that pre-trained models significantly outperform comparable non-pre-trained neural models. This remains true even in experiments with training non-pre-trained models with fewer parameters to account for model regularization effects.

They further explore the effect of text domain on LLMs by pretraining models using text from different domains and provenances. The experiments surprisingly reveal that the positive effects of pretraining persist even when pre-training on multilingual text or computer code, and even on text generated from synthetic languages. This suggests an unexplored deep connection between pretraining and the inductive learning abilities of language models.

Final Thoughts

As we've seen, language models are evolving rapidly. Large language models are being used for a vast array of use cases beyond natural language generation (NLG). We still have a lot to learn about how these features work and how they should be constructed. If you're working with large volumes of text, you can possibly benefit greatly by incorporating large language models into your workflow. It may take some experimentation and tweaking to get the model to do exactly what you want, but these papers should give you an idea of how others go about it.

Is there a paper we should include in our next issue? Let us know on our Discord community. Get started with Cohere, try out our playground and start building.