Topic Modeling with BERTopic - Talking Language AI Ep#1
In the first episode of the Talking Language AI series, I spoke with Maarten Grootendorst, author and maintainer of the BERTopic open source package (over 3,000 stars on Github). BERTopic is used to explore collections of text to spot trends and identify the topics in these texts. This is an NLP task called Topic Modeling.
View the full episode here. It's also embedded below. Feel free to post questions or comments in this thread in the Cohere Discord.
Topic modeling and BERTopic overview
Maarten started by giving an overview of BERTopic and what topic modeling is. In this overview, Maarten used awesome visual to explain what topic modeling is:
A visual way to describe BERTopic is as pipeline of the following steps:
Maarten discussed three central pillars of BERTopic:
1- Modularity. To demonstrate modularity, Maarten showed the following awesome visual of the building blocks in the BERTopic pipeline and other options to construct the pipeline.
BERTopic allows for a number of ways to visualize the created topics. This includes the topic word scores plot:
Another commonly used visual is the Documents and Topics visual:
Topic modeling needs different approaches in different scenarios, to address that, Maarten says that BERTopic is built to be flexible for use in many different scenarios.
Maarten also went through a demo of how to use BERTopic to explore a dataset of research papers. I then proceeded to ask Maarten a few questions about his experience building NLP tools like BERTopic and others.
Questions about creating NLP software and topic modeling
After Maarten's overview of BERTopic, I asked him the following questions. The links should lead you to that section of the video.
Q: How do you think about evaluating topic modeling tasks?
Q: BERTopic assigns a single topic to each document. Is that a limitation, or is it good enough for many use cases?
Q: How differently should long texts and short texts should be treated when using BERTopic?
Q: How do you think about API design philosophy for tasks like this?
Q: You have built library called KeyBERT. What does KeyBERT do?
Q: Another package you've built is PolyFuzz. What is PolyFuzz?
Q: When using BERTopic in a language other English, what should BERTopic users change in the pipeline?
Q: When using the default BERTopic pipeline, HDBSCAN clustering often results in a large noise cluster (cluster: -1). How do you suggest users deal with that?
Q: How does BERTopic compare to LDA and Top2vec?
Q: What happens after topic modeling? Is it only used to generate reports? Have you seen it being used to create online systems?
Q: What do you think of using GPT language models in the topic modeling pipeline?
Q: By creating and maintaining BERTopic, you have created value for a lot of people. How can people contribute back?