In the second episode in our series on applied NLP, I sat down with Vincent Warmerdam, a machine learning engineer at Explosion, the company behind spaCY and Prodigy for data labeling. Vincent builds a lot of NLP tools, many of which target the scikit-learn ecosystem, and his recent focus has been to improve training data. During our discussion, Vincent walks us through a few of these tools and shows us how they work together.

View the full episode (also embedded below). Feel free to post questions or comments in the thread on this episode in the Cohere Discord channel.

The theme of our discussion is data quality. There are plenty of examples of badly labeled data sets out in the world — some funny, and others alarming. When you pre-process the data and push it through a model, you may get good accuracy metrics but bad predictions. Inspired by this problem, Vincent set out to make simple tools that made all the steps in the process more visible and better supported the humans in the process.

In this session, Vincent demos several tools that he designed to help improve data quality in NLP use cases.

Human-learn

A toolkit to build human-based scikit-learn components. See the overview and GitHub repo.

Doubtlab

A toolkit to help find doubtful labels in data. See the overview and GitHub repo.

Embetter

A library that makes it very easy to use embeddings in scikit-learn. See the GitHub repo.

Bulk

A library that uses embeddings to leverage bulk labeling. See the GitHub repo.

To go deeper into these tools, and other concepts around data quality, watch the video and join the conversation on Discord. Stay tuned for more episodes in our Talking Language AI series!

P.S. Also, join the co:mmunity conversation on Discord.