The Best of NLP: February 2023's Top NLP Papers

Stay ahead of the game: Get a sneak peek of the coolest natural language processing (NLP) research of February 2023! Our handpicked selection of the best NLP papers will keep you up-to-date on the latest advancements in language models, text generation, and summarization.
TL;DR:
- For all you NLP enthusiasts out there, here is a list of awesome papers from February 2023 highlighted by C4AI’s research community.
This article’s title and TL;DR have been generated with Cohere.
Get started with text generation
As NLP enthusiasts, we know that this technology is constantly pushing the boundaries of what's possible. That's why it's crucial to stay up-to-date with the latest breakthroughs and advancements. In this post, we've curated a selection of the top NLP papers for February 2023, covering a wide range of topics, including the most recent developments in language models, text generation, and summarization.
Our team at Cohere has done the heavy lifting by scouring the web and consulting with our research community to bring you the most current and relevant information on NLP research. We're thrilled about the progress that NLP has made in recent years, and we can't wait to see what the future holds. The advancements in this field are enabling us to do more with language than ever before, and this list of top NLP papers will keep you informed and prepared to take advantage of these developments.
At Cohere, our goal is to make NLP technology more accessible to developers and organizations. We believe that the democratization of NLP is key to unlocking its full potential. That's why we are always looking for new community members to join us on this exciting journey. If you're passionate about NLP and want to be part of a community that is driving the future of this technology, we would love to have you. Don't hesitate to apply and be a part of this exciting journey.

Top NLP Papers of February 2023 Highlighted by Our Research Discord Community
These papers were highlighted by C4AI research discord community members. Big thank you to Ujan#3046, bhavnicksm#8949, EIFY#4102, cvf#1006, MajorMelancholy#1836, cakiki#9145, hails#6601, Mike-RsrchRabbit#9843, and the rest of the Cohere For AI NLP research community for participating.
Toolformer: Language Models Can Teach Themselves to Use Tools
Authors: Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom
Let's talk about language models (LMs), which are pretty cool because they can solve new tasks with just a few examples or textual instructions. However, as amazing as they are, LMs sometimes struggle with basic functionality, like doing simple math or finding facts, where smaller models excel. But what if NLP folks could have the best of both worlds? Enter Toolformer!
Toolformer is a model that can teach itself to use external tools via simple APIs. It's trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. And get this - it does this in a self-supervised way, requiring nothing more than a handful of demonstrations for each API.
Toolformer incorporates a range of tools, including a calculator, a Q&A system, two different search engines, a translation system, and a calendar. And the best part is that it achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities. So, with Toolformer, we're able to use the best of both worlds, making life a whole lot easier for us NLP, machine learning, AI, and software engineering enthusiasts.
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
Authors: Max Ryabinin, Tim Dettmers, Michael Diskin, Alexander Borzunov
In this paper, the authors tackle the challenge of training large deep learning models with billions of parameters, which is known to require specialized HPC clusters that come with a hefty price tag. To work around this limitation, they explore alternative setups for training these large models, such as using cheap "preemptible" instances or pooling resources from multiple regions.
Then it analyzes the performance of existing model-parallel algorithms in these conditions and identifies configurations where training larger models become less communication-intensive. They introduce SWARM parallelism, a novel model-parallel training algorithm specifically designed for poorly connected, heterogeneous, and unreliable devices.
SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure, which is a significant improvement over existing large-scale training approaches. The authors empirically validate their findings and compare SWARM parallelism with existing methods.
To further demonstrate their approach, they combine their insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network. These promising results show that SWARM parallelism has the potential to revolutionize the way large models are trained, making it more accessible and cost-effective for researchers and practitioners alike.
Pretraining Language Models with Human Preferences
Authors: Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez
In this paper, the authors delve into the exciting world of language models (LMs) and how they can be trained to generate text that aligns with human preferences. While LMs are pre-programmed to imitate internet text, which can lead to some undesirable outcomes. But what if LMs could be taught to generate text that's not only coherent and informative but also aligns with human preferences?
To explore this, the authors benchmarked five objectives for pretraining LMs with human feedback across three tasks. They studied how these objectives affect the balance between the alignment and capabilities of pretrained LMs. And what they found was a Pareto-optimal approach: conditional training.
Conditional training involves teaching the LM to learn the distribution over tokens conditional on their human preference scores, given by a reward model. And the results were impressive! Conditional training reduced the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt.
Moreover, conditional training maintained the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback resulted in much better preference satisfaction than standard LM pretraining, followed by finetuning with feedback.
Overall, the results suggest that it should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training. This is a huge step forward in ensuring that language models generate text that aligns with human preferences, and it's exciting to see where this technology will go in the future!
Multimodal Chain-of-Thought Reasoning in Language Models
Authors: Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola
In this paper, the authors introduce a groundbreaking new approach for large language models (LLMs) that combines text and vision to achieve even better reasoning performance. The new model, called Multimodal-CoT, builds on the chain-of-thought (CoT) approach to generate intermediate reasoning chains as the rationale to infer the answer. The big difference is that this time, the model incorporates both language and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference.
The Multimodal-CoT model is designed to leverage better-generated rationales that are based on multimodal information, improving the accuracy of answer inference. The results speak for themselves: the model with just under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by a whopping 16 percentage points (75.17% to 91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance.
The code for Multimodal-CoT is publicly available on Amazon, so if you're interested in exploring this cutting-edge technology, it's just a click away. With this new model, the authors have taken an important step forward in the development of large language models and multimodal reasoning, paving the way for even more exciting advances in the field of AI and machine learning.
Poisoning Web-Scale Training Datasets is Practical
Authors: Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr
In this paper, the authors dive deep into the dangers of dataset poisoning attacks in deep learning models. These attacks introduce malicious examples into a model's performance, which can have serious consequences. The authors introduce two new and practical attacks that can poison ten popular datasets.
The first attack, split-view poisoning, takes advantage of the mutable nature of internet content. By manipulating an annotator's view of a dataset, they can introduce malicious examples that will go unnoticed by subsequent clients. This attack is particularly insidious because it exploits invalid trust assumptions. Shockingly, the authors found they could poison 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD.
The second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content, like Wikipedia. The attacker only needs a time-limited window to inject malicious examples into the dataset.
In light of these attacks, the authors notified the maintainers of each affected dataset and recommended several low-overhead defenses. These defenses will help mitigate the risks of dataset poisoning and protect deep learning models from malicious attacks.
Symbolic Discovery of Optimization Algorithms
Authors: Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le
In this paper, the authors introduce a novel approach to algorithm discovery by framing it as program search. They apply this method to discover optimization algorithms for deep neural network training and demonstrate how it can bridge the generalization gap between proxy and target tasks.
Their approach utilizes efficient search techniques to explore an infinite and sparse program space. To simplify the process, they also introduce program selection and simplification strategies. The result of their method is the discovery of a new optimization algorithm, Lion (EvoLved Sign Momentum).
Compared to widely used optimizers such as Adam and Adafactor, Lion is more memory-efficient since it only keeps track of the momentum. It also differs from adaptive optimizers in that its update has the same magnitude for each parameter calculated through the sign operation.
The authors test Lion on various models and tasks and show that it outperforms Adam in several areas, including image classification and diffusion models. In some cases, Lion also requires a smaller learning rate due to the larger norm of the update produced by the sign function.
However, the authors also acknowledge the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. They make the implementation of Lion publicly available for others to use and build upon.
The Wisdom of Hindsight Makes Language Models Better Instruction Followers
Authors: Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, Joseph E. Gonzalez
In this paper, the authors delve into the complex world of reinforcement learning and its application in fine-tuning language models. Specifically, they explore the “Reinforcement Learning with Human Feedback (RLHF)” algorithm, which has demonstrated remarkable success in aligning GPT series models with instructions through human feedback.
However, the authors point out that the underlying RL algorithm is not a walk in the park and requires an additional training pipeline for reward and value networks. So, they propose an alternative approach: relabeling the original feedback and training the model for better alignment in a supervised manner. This algorithm doesn't require any additional parameters except for the original language model and maximally reuses the pretraining pipeline.
To accomplish this, the authors formulate the instruction alignment problem for language models as a goal-reaching problem in decision-making. They present a novel algorithm called Hindsight Instruction Relabeling (HIR), which aligns language models with instructions based on feedback that has been relabeled with hindsight.
The resulting two-stage algorithm sheds light on a family of reward-free approaches that utilize the relabeled feedback as a substitute for reward. The authors evaluate the performance of HIR on 12 challenging BigBench reasoning tasks and show that it outperforms the baseline algorithms and is comparable to, or even surpasses, supervised fine-tuning.
In conclusion, the paper offers an intriguing new approach to fine-tuning language models that has the potential to reduce the complexity of the reinforcement learning algorithm and streamline the training process.
Hyena Hierarchy: Towards Larger Convolutional Language Models
Authors: Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré
In this paper, the authors introduce us to Hyena, a subquadratic replacement for the attention operator in Transformers. While attention has been the core building block of Transformers, it suffers from quadratic cost in sequence length, which makes it difficult to access large amounts of context. To bridge this gap, the authors propose Hyena, which is constructed by interleaving implicitly parametrized long convolutions and data-controlled gating.
What's exciting about Hyena is that it can significantly improve accuracy in recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens. In fact, it improves accuracy by more than 50 points over operators relying on state spaces and other implicit and explicit methods. Not only that, but Hyena can match attention-based models, setting a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile).
In addition to its accuracy, Hyena can reduce training compute required at sequence length 2K by 20%. Its operators are also twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K. This means that not only is Hyena powerful, but it's also efficient. Overall, Hyena presents a promising new approach to subquadratic methods in deep learning that could have wide-ranging implications for the field.
Crawling the Internal Knowledge-Base of Language Models
Authors: Roi Cohen, Mor Geva, Jonathan Berant, Amir Globerson
Language models are becoming increasingly sophisticated, and as they continue to evolve, they will eventually be able to extract a significant body of factual knowledge from the vast amount of text they are trained on. This wealth of knowledge can then be used to enhance downstream NLP tasks. But how can this knowledge be represented in an interpretable way? That's where the authors' proposal comes in.
The authors present a novel approach to extract a knowledge-graph of facts from a given language model. They start by "crawling" the internal knowledge-base of the language model and expanding a knowledge-graph around a seed entity. The crawling procedure is broken down into sub-tasks, which are achieved through specially designed prompts that ensure high precision and recall rates.
The authors evaluated their approach on graphs crawled from dozens of seed entities and found that it yielded high-precision graphs ranging from 82% to 92%. The procedure also emitted a reasonable number of facts per entity, which is important for practical applications. This work is an important step towards building more interpretable language models that can provide a structured representation of the knowledge they acquire from text.
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
Authors: Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, Chelsea Finn
In this paper, the authors tackle the problem of detecting machine-generated text, which has become increasingly difficult with the advancement of large language models (LLMs). These models are so good at generating text that it's becoming harder to tell whether a piece of writing is human or machine-generated. For instance, students could use these models to complete their writing assignments, making it harder for instructors to assess their work.
To solve this issue, the authors propose a new approach called DetectGPT, which uses the curvature of the model's log probability function to identify whether a given passage was generated by the LLM in question. This new method doesn't require a separate classifier or a dataset of real or generated passages, and it doesn't explicitly watermark the generated text.
To test the effectiveness of DetectGPT, the authors use it to detect fake news articles generated by the massive 20B parameter GPT-NeoX model. The results are impressive, with DetectGPT significantly outperforming existing zero-shot methods for detecting model samples. The strongest zero-shot baseline achieved a 0.81 AUROC, while DetectGPT achieved an impressive 0.95 AUROC.
If you're interested in this exciting new approach to detecting machine-generated text, check out the code, data, and other project information.
Final Thoughts
Are you ready to revolutionize the way you work with large volumes of text? Look no further than incorporating large language models into your workflow. This list of cutting-edge research on NLP serves as your guide to unlocking the full potential of this powerful technology. But don't just take our word for it—experiment and tweak to find the perfect model for your specific needs. And the journey doesn't have to be a solitary one—join our Discord community to share your discoveries and collaborate with like-minded individuals. Ready to dive in? Try out our NLP API on the Cohere playground and start building the future of natural language processing today.