In the second episode in our series on applied NLP, I sat down with Vincent Warmerdam, a machine learning engineer at Explosion, the company behind spaCY and Prodigy for data labeling. Vincent builds a lot of NLP tools, many of which target the scikit-learn ecosystem, and his recent focus has been to improve training data. During our discussion, Vincent walks us through a few of these tools and shows us how they work together.
The theme of our discussion is data quality. There are plenty of examples of badly labeled data sets out in the world — some funny, and others alarming. When you pre-process the data and push it through a model, you may get good accuracy metrics but bad predictions. Inspired by this problem, Vincent set out to make simple tools that made all the steps in the process more visible and better supported the humans in the process.
In this session, Vincent demos several tools that he designed to help improve data quality in NLP use cases.
A library that makes it very easy to use embeddings in scikit-learn. See the GitHub repo.
A library that uses embeddings to leverage bulk labeling. See the GitHub repo.
P.S. Also, join the co:mmunity conversation on Discord.