In the first episode of the Talking Language AI series, I spoke with Maarten Grootendorst, author and maintainer of the BERTopic open source package (over 3,000 stars on Github). BERTopic is used to explore collections of text to spot trends and identify the topics in these texts. This is an NLP task called Topic Modeling.

View the full episode here. It's also embedded below. Feel free to post questions or comments in this thread in the Cohere Discord.

Topic modeling and BERTopic overview

Maarten started by giving an overview of BERTopic and what topic modeling is. In this overview, Maarten used awesome visual to explain what topic modeling is:

A collection of documents that becomes three topics, one with keywords "christian" and "faith", another topic with keywords "space" and "launch", and a third topic with keywords "key" and "encryption" — When you have a collection of documents, topic modeling can group them into topics and show the main keywords in each topic

A visual way to describe BERTopic is as pipeline of the following steps:

Topic modeling with BERT by default is done through a pipeline of SBERT embeddings, dimensionality reduction with UMAP, clustering with HDBSCAN, bag-of-words extraction, then topic representation with the cTF-IDF and MMR methods

Maarten discussed three central pillars of BERTopic:

1- Modularity. To demonstrate modularity, Maarten showed the following awesome visual of the building blocks in the BERTopic pipeline and other options to construct the pipeline.

BERTopic's design allows for component to be swapped for other components based on the situation.

2- Visualization

BERTopic allows for a number of ways to visualize the created topics. This includes the topic word scores plot:

After processing the text archive, BERTopic shows the list of topics and the relevant keywords for each topic.

Another commonly used visual is the Documents and Topics visual:

Each dot is a document. Documents close together are similar in meaning and topic (because their position is based on text embeddings). Choosing the topic on the right highlights where that topic is on the plot.

3- Variations

Topic modeling needs different approaches in different scenarios, to address that, Maarten says that BERTopic is built to be flexible for use in many different scenarios.

Maarten also went through a demo of how to use BERTopic to explore a dataset of research papers. I then proceeded to ask Maarten a few questions about his experience building NLP tools like BERTopic and others.

Questions about creating NLP software and topic modeling

After Maarten's overview of BERTopic, I asked him the following questions. The links should lead you to that section of the video.

Q: How do you think about evaluating topic modeling tasks?

Q: BERTopic assigns a single topic to each document. Is that a limitation, or is it good enough for many use cases?

Q: How differently should long texts and short texts should be treated when using BERTopic?

Q: How do you think about API design philosophy for tasks like this?

Q: You have built library called KeyBERT. What does KeyBERT do?

Q: Another package you've built is PolyFuzz. What is PolyFuzz?

Q: When using BERTopic in a language other English, what should BERTopic users change in the pipeline?

Q: When using the default BERTopic pipeline, HDBSCAN clustering often results in a large noise cluster (cluster: -1). How do you suggest users deal with that?

Q: How does BERTopic compare to LDA and Top2vec?

Q: What happens after topic modeling? Is it only used to generate reports? Have you seen it being used to create online systems?

Q: What do you think of using GPT language models in the topic modeling pipeline?

Q: By creating and maintaining BERTopic, you have created value for a lot of people. How can people contribute back?