Top Natural Language Processing (NLP) Papers of January 2023
Get ready for cutting-edge NLP research! Our top NLP papers for January 2023 cover language models, text generation, and summarization. Discover the latest advancements in language processing with Cohere's selection of the best research.

If you work in NLP, it's important to keep up to date with the latest research. In this post, we look at some of the best papers on NLP for January 2023!
TL;DR:
- For all you NLP enthusiasts out there, here is a list of awesome papers from January 2023 highlighted by C4AI’s research community.
This article’s title and TL;DR have been generated with Cohere.
Get started with text generation
NLP is an ever-evolving field that is constantly pushing the boundaries of what's possible. As enthusiasts of this technology, it's crucial to stay up-to-date with the latest breakthroughs and advancements. In this post, we've curated a selection of the top NLP papers for January 2023, covering a wide range of topics, including the most recent developments in language models, text generation, and summarization.
Our team at Cohere has scoured the web and consulted with our research community to bring you the most current and relevant information on NLP research. We're thrilled about the progress that NLP has made in recent years and can't wait to see what the future holds. The advancements in this field are enabling us to do more with language than ever before and this list of top NLP papers will keep you informed and prepared to take advantage of these developments.
If you're passionate about NLP and want to join our research community, we would love to have you. At Cohere, our goal is to make NLP technology more accessible to developers and organizations, and we're always looking for new community members to help us achieve that. Don't hesitate to apply and be a part of this exciting journey.
Top Papers of January 2023 Highlighted by Our Research Discord Community
These papers were highlighted by C4AI research discord community members. Big thank you to domenicrosati#3567, bun#9632, MajorMelancholy#1836, KILLSHOT#0287, bhavnicksm#8949, and the rest of the Cohere For AI NLP research community for participating.
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
Authors: Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, et al.
This paper describes a large benchmark for instruction meta-learning (IML) of 2000 natural language processing (NLP) tasks, called OPT-IML Bench, that consolidates task categories from 8 existing benchmarks. The authors evaluate the performance trade-offs of different decisions made during the instruction-tuning process, such as the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and fine-tuning objectives. They use this benchmark to train two instruction-tuned versions of OPT, called OPT-IML 30B and 175B, which are shown to have better generalization abilities than OPT on four different evaluation benchmarks. The authors also release the OPT-IML Bench evaluation framework and the trained models to the public.
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
Authors: Tri Dao, Daniel Y. Fu, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré
This paper discusses the use of state space models (SSMs) in language modeling and compares their performance to attention-based models. The authors find that SSMs struggle with recalling earlier tokens in a sequence and comparing tokens across a sequence. They propose a new SSM layer, H3, which addresses these issues and matches attention on synthetic languages and comes close to the performance of Transformers on OpenWebText. They also propose FlashConv, a method for improving the efficiency of training SSMs on modern hardware, which allows them to scale to longer sequences. Overall, the paper aims to bridge the expressivity gap between SSMs and attention models, and improve the efficiency of SSMs for language modeling.
A Watermark for Large Language Models
Authors: John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein
This paper proposes a framework for watermarking proprietary language models to mitigate potential harms. The watermark is embedded into generated text in a way that is invisible to humans but can be detected algorithmically. The proposed method has a negligible impact on text quality and can be detected using an open-source algorithm without access to the model API or parameters. The watermark works by selecting a randomized set of whitelist tokens and promoting their use during sampling. The authors also propose a statistical test for detecting the watermark and provide an information-theoretic framework for analyzing its sensitivity. They tested the watermark using a multi-billion parameter model and discussed robustness and security.
CiT: Curation in Training for Effective Vision-Language Data
Authors: Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
This paper presents a method called Curation in Training (CiT) that aims to make large vision-language models more efficient to train, in order to be more accessible to a wider range of institutions. CiT automatically selects high-quality training data to speed up contrastive image-text training, and does not require an offline data filtering pipeline, which allows for a broader range of data sources. The algorithm is composed of two loops: an outer loop that curates the training data and an inner loop that consumes the curated training data. The text encoder connects the two loops. CiT uses metadata for tasks of interest, such as class names, and a large pool of image-text pairs to select relevant training data by measuring the similarity of their text embeddings and embeddings of the metadata. The experiments showed that CiT can significantly speed up training, especially when the raw data size is large.
Teaching Small Language Models to Reason
Authors: Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn
This paper explores a method for transferring reasoning capabilities from large language models to smaller models through knowledge distillation. The authors show that finetuning a smaller "student" model on the output of a larger "teacher" model (using a technique called chain of thought prompting) can improve performance on a range of reasoning tasks, such as arithmetic, commonsense, and symbolic reasoning. The experiments in the paper demonstrate that this approach can significantly improve task performance, for example, increasing the accuracy of a smaller model on a dataset called GSM8K from 8.11% to 21.99% when finetuned on PaLM-540B generated chains of thought.
Large Language Models Are Reasoning Teachers
Authors: Namgyu Ho, Laura Schmid, Se-Young Yun
This paper explores a method for transferring reasoning capabilities from large language models to smaller models through fine-tuning. The authors propose "Fine-tune-CoT," a method that leverages the capabilities of very large language models (such as GPT-3) to generate reasoning samples and teach smaller models. They evaluate their method on publicly available language models across a wide range of complex tasks and model sizes and find that Fine-tune-CoT enables substantial reasoning capability in small models, whereas previous prompt-based baselines exhibit near-random performance. The student models can even outperform the teacher in some tasks while reducing model size requirements by several orders of magnitude. They conduct extensive ablation studies and sample studies to understand the reasoning capabilities of student models, and identify several important nuances that have been overlooked in concurrent fine-tuning works on CoT.
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Authors: Elias Frantar, Dan Alistarh
This paper presents a new pruning method called SparseGPT that can reduce the number of weights in large-scale generative pre-trained transformer (GPT) models by at least 50% without any retraining and minimal loss of accuracy. The authors demonstrate this by applying SparseGPT to the largest open-source models, OPT-175B and BLOOM-176B, and achieving 60% sparsity with little increase in perplexity. The method is also compatible with weight quantization approaches and can generalize to other patterns.
Does compressing activations help model parallel training?
Authors: Song Bian, Dacheng Li, Hongyi Wang, Eric P. Xing, Shivaram Venkataraman
This paper examines the effectiveness of different compression methods for model parallelism in large-scale Transformer models. The authors conduct an empirical study using three types of compression algorithms: pruning-based, learning-based, and quantization-based, on a popular Transformer training framework. They evaluate these methods across over 160 settings and 8 popular datasets, taking into account different hyperparameters, hardware, and both fine-tuning and pre-training stages. The paper provides insight into the differences between model parallelism and data parallelism and provides recommendations for the future development of model parallelism compression algorithms.
Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model
Authors: Yeskendir Koishekenov, Vassilina Nikoulina, Alexandre Berard
This paper presents a pruning method for a massively multilingual machine translation model called NLLB-200. The method allows the removal of up to 80% of experts with minimal loss of translation quality, thus reducing the inference cost of running the model. The authors also show that the pruning method is able to identify language-specific experts and prune non-relevant experts for a given language pair. This makes it possible to run the model on a single 32GB GPU. The paper aims to address the challenge of the curse of multilinguality in massively multilingual models by reducing their size without compromising on their performance.
Guiding the Release of Safer E2E Conversational AI through Value Sensitive Design
Authors: A. Stevie Bergman, Gavin Abercrombie, Shannon Spruit, Dirk Hovy, Emily Dinan, Y-Lan Boureau, Verena Rieser
This work presents a framework for practitioners to decide on the release of end-to-end neural conversational agents. The authors are motivated by the recent progress in the field of conversational AI and the potential harms that might arise from releasing models trained on large datasets from the Internet. They survey recent and related work to highlight the tension between values, potential positive impact, and potential harms. They propose a framework based on the principle of value-sensitive design to help practitioners weigh the pros and cons and make ethical decisions about the release of these models.
Final Thoughts
Are you ready to revolutionize the way you work with large volumes of text? Look no further than incorporating large language models into your workflow. This list of cutting-edge research on NLP serves as your guide to unlocking the full potential of this powerful technology. But don't just take our word for it—experiment and tweak to find the perfect model for your specific needs. And the journey doesn't have to be a solitary one—join our Discord community to share your discoveries and collaborate with like-minded individuals. Ready to dive in? Try out our NLP API on the Cohere playground and start building the future of natural language processing today.