Combing For Insight in 10,000 Hacker News Posts With Text Clustering

Combing For Insight in 10,000 Hacker News Posts With Text Clustering
The orange and purple clusters contain highly insightful Hacker News topics

Hacker News is one of the leading online communities to discuss software and startup topics. I’ve frequented the site for over ten years and constantly admire the quality of its signal vs. noise ratio. It houses a wealth of knowledge and insightful discussions accumulated over the years. That invaluable archive is hard to explore, however, aside from keyword search. I wanted a better way to explore it. I want the ability to browse by topic, to zoom in on interesting areas and life experiences, and be able to extract more of the personal and professional insights I’ve been coming across over the years.

So I built a map of the top 10,000 Hacker News posts of all time. I visualized it using the embeddings of the titles and kept browsing it and slicing till I zoomed in on some of my favorite areas of Hacker News (turns out 'Ask HN' contained the majority of what I was aiming to find):

Life experiences and advice threads

This cluster (Ask HN cluster #6) had the most personal life insight. The articles here tend to be from people asking for personal insight. Keywords in this cluster (extracted with and algorithm called cTFIDF, more on that below) include advice, deal, work, and life. Example posts include:

Technical and personal development

This cluster (Ask HN cluster #7) groups discussions about advancing technical knowledge with some overlap on personal development. Keywords in this cluster include: self, cs, courses, design. Example posts include:

Software career insights, advice, and discussions

This cluster (Ask HN cluster #5) groups discussions about software engineering careers, interviews, and negotiations. Its keywords include: career, tech, developer, remote, software, jobs. Example posts include:

General content recommendations (blogs/podcasts)

This cluster (Ask HN cluster #3) revolves around content recommendations. A lot of the posts asked for recommended blogs, books, podcasts, and talks. Example posts include:

These are the clusters I personally found the most interesting. The other clusters in the group are:

Note: Click on the name of the cluster to see the posts it contains.


Browse the maps yourself

I have two maps for you. They are best viewed on desktop/laptop. The rest of the article goes into how they were created:

1) Top 10,000 Hacker News articles of all time

0:00
/

2) Top 3,000 posts in Ask HN

0:00
/

In these figures:

  • Each dot is an article
  • The closer two dots are, the closer their meanings are
  • Hover to read the title
  • Click a dot to go to the HN article (or shift+click so it opens in a different tab)
  • Raw data: We're making the data of the top 3K "Ask HN" posts available for you to experiment with. This includes both the CSV file containing the posts and their metadata, as well as the embeddings vectors of the titles.
  • See this Notebook/Colab for all the code and how to load the embeddings, visualize them, cluster them, and use them for semantic search.

Exploring large amounts of text with clustering

An image of a group of documents, then after the process of embedding and clustering, the documents are clustered into three different groups.

This post demonstrates a common NLP use case often referred to as document clustering or topic modeling. It's the exercise of analyzing a large amount of text information and grouping documents (or headlines in our case) into groups. These can be:

  • Analyzing news articles to group similar articles (like how Google News groups articles about the same event)
  • Analyzing customer emails to identify common requests and themes
  • Exploring financial filings and earning

How it was built

The map was built using language models and a collection of NLP tools. It revolves around embedding post titles and turning each one into a vector embedding using Cohere’s Embed endpoint. I’ve uploaded the embeddings so you can download and experiment with them.

So how do you make sense of ten thousand pieces of text without reading them individually?

0:00
/
There has got to be a better way to explore thousands of article headlines

Getting the dataset

The Hacker News archive is available as a public dataset on BigQuery. Once you configure your credentials, you can retrieve posts or comments with a SQL query. This is the query I’ve used to get the top scoring HN posts of all time:

SELECT *
FROM `bigquery-public-data.hacker_news.full`
WHERE TYPE = 'story'
AND score > 10
ORDER BY score DESC 
LIMIT 10000

Getting meaningful text representations with Embed()

The next step was to embed these titles so we can examine the dataset based on the meanings of the titles and not just the tokens they contain.

Cohere’s embed endpoint gives us vector representations from a large embedding language model specifically tuned for text embedding (as opposed to word embedding or text generation).

  
  embeds = co.embed(texts=list_of_posts,                  				
  					model="small",
  					truncate="LEFT").embeddings
  

This gives us a matrix where each post title has a 1024 dimensional vector numerically containing its meaning.

Plotting

Next, we’ll reduce the embeddings down to two dimensions so we can plot them and explore which posts are similar to each other. We use UMAP for this dimensionality reduction.

The UMAP call looks like this:

  
import umap

reducer = umap.UMAP(n_neighbors=100)
umap_embeds = reducer.fit_transform(embeds)
  

Which creates a map that looks like this:

The semantic search guide walks you through how to build such a figure for your data. From my examination, here are some of the clear regions of this map.

Ask HN

Titles that contain 'Ask HN' highlighted on the plot
Titles that contain the term "Ask HN"

Show HN

Titles that contain 'Show HN' highlighted on the plot
Titles that contain the term "Show HN"

Startup

Titles that contain 'Startup' highlighted on the plot
Titles that contain the term "Startup"

Google

Titles that contain 'Google' highlighted on the plot
Titles that contain the term "Google"

Covid

Titles that contain 'covid' highlighted on the plot
Titles that contain the term "covid"

Database

Titles that contain 'database' highlighted on the plot
Titles that contain the term "database"

Postgres

Titles that contain 'postgres' highlighted on the plot
Titles that contain the term "postgres"

It’s easy to spend quite a bit of time exploring such a figure. But to me, it was clear to me that Ask HN is where a lot of insightful discussions are likely to take place.

Zooming into Ask HN

To zoom closer into Ask HN, I made a new query to get the top 3K Ask HN posts of all time. I removed the “Ask HN:” string from the titles before embedding these titles in an attempt (heuristic) to focus the model on the important part of the title (whatever comes after 'ask hn:').

SELECT *
FROM `bigquery-public-data.hacker_news.full`
WHERE TYPE = 'story'
AND score > 10
AND CONTAINS_SUBSTR(title, "ask hn")
ORDER BY score DESC
LIMIT 3000

Plotting this batch absolutely validates that this is where a lot of the most fascinating threads reside.

Showing more information with clustering

Let’s now cluster these posts to understand their overall hierarchy. The goal here is to add more visual information instead of relying on hovering over points to reveal their contents.

We can use KMeans clustering on the original embeddings to create eight clusters.

  
from sklearn.cluster import KMeans

# Pick the number of clusters
n_clusters = 8

# Cluster the embeddings
kmeans_model = KMeans(n_clusters=n_clusters)
classes = kmeans_model.fit_predict(embeds)
  

Which can then be plotted to look like this:

Conceptually, this is what we've done so far:

To extract the main keywords for each cluster, we can use the cTFIDF algorithm from the awesome BERTopic package. That results in these being the main keywords for each cluster:

Understanding the hierarchy of the topics

We can use a hierarchical plot to better understand the hierarchy of the clusters. It’s useful that KMeans produces a centroid for each cluster - a point in 1024-dimensional space that we can use to represent the cluster.

For this plot, we use the hierarchy package from scipy:

  
Z = hierarchy.linkage(kmeans_model.cluster_centers_, 'single')
dn = hierarchy.dendrogram(Z, orientation='right',
                         labels=label_list)
  

Here’s how we can read this hierarchy (scanning it from the right to left):

  • The first main branching is between the posts for hiring/seeking posts and the remainder of Ask HN. If we’re to break Ask HN into only two clusters, those would be the clearest two clusters.
  • From the large cluster (everything except hiring/seeking posts), cluster #3 (life, reading, hn, blogs, books) is the next more distinct cluster we can peel off the big group.
  • Imagine a vertical line slicing the tree at a certain value on the X-axis, and that would show you the resulting clusters.

Back to the 10K HN dataset

Let us go back to the bigger dataset and see how it would look if we cluster the entire top 10K HN posts in the same way. Because it's a bigger dataset, let's break it down into 15 clusters:

If you know how KMeans works, you might be surprised why one cluster can be in two different places. Which is why it's important to note that we are clustering the embeddings (1024 dimensions), and not the coordinates (2 dimensions obtained with UMAP). This is because the embeddings contain so much more information about each post than can be held in simply two dimensions. In fact, for industrial use-cases, you can ramp it up with larger embedding models (At the time of writing, large-20220425 is a much larger model which will produce 4,096 dimensional embeddings -- capturing even more information from the text).

We can also examine the hierarchy of the clusters:

Next steps after topic modeling

Topic modeling reveals a lot about a text archive. The insights revealed by topic modeling can often by built into a system in ways such as:

  • Classifying new pieces of text into the same topic modeling developed in the exploration (if the topics or clusters are broad like "Sports" / "Technology")
  • Updating the topic model periodically (So different articles about a a news event can start to have their own topic/cluster)
  • Building a classifier to identify a certain topic or type of document/text. Examples here are content recommendation (I'm definitely building a classifier for Hacker News topics about life experiences and advice threads based on cluster #6 above) and content moderation (where the dataset is the archive of a Discord channel, for example, and the community guidelines don't allow for a specific type of comment). See the Slack bot tutorial for a guide that can get you started on such a use case.

Takeaways

Here are some of my takeaways from looking more closely at this process:

  • Text is computable: Language models enable truly fascinating applications of computing, organizing, and retrieving text. We haven't yet scratched the surface in terms of what's possible. We have a visual introduction to language models to help you catch up to this kind of technology. You can get started and access these models right now.
  • NLP Methods: There's a lot of value in complementing language models with other NLP methods like TF-IDF -- not necessarily only as feature extractors as they've traditionally been used, but in other places in the pipeline as well. Clusters like Cluster #0 lend themselves to entity extraction to identify the companies mentioned
  • Clustering Methods: There are many clustering options to consider and each with many parameters to play with. What we used here was KMeans. We also used hierarchical clustering of the KMeans centroids, but it can be applied on the whole dataset as well. BERTopic uses HDBSCAN which allows clustering without choosing a certain number of clusters (the flip side is sometimes a large "non-clustered" cluster, likely addressable with certain parameter experimentation).
  • Number of Clusters: When exploring a dataset, I find that it’s more intuitive to break the dataset into a small number of clusters at first (say five to eight clusters) then increase the number as you become more familiar with the space.
  • Zooming into a Cluster: In addition to simply increasing the number of clusters, it's often handy to zoom into a specific cluster alone exclusive of the remainder of the dataset. With UMAP, running the dimensionality reduction again on that cluster alone produces a better plot for that cluster.
  • Topic Modeling: A common way of doing topic modeling is using Linear Discriminant Analysis (LDA). One property of that approach is that it assigns each document a certain percentage of membership in each topic. If that's a desirable property for a use case, then it would be interesting to experiment with soft-clustering methods like Gaussian Mixture Models, for example.

Finally

Be sure to check out the maps linked above and check out the  Notebook/Colab if you want to plot your own data. I'm interested to extend this work to have generative models name the clusters. You can follow and contribute to that experiment in this forum post.

Acknowledgements
Thanks to Aidan Gomez, Almond Au, Carlos Timoteo, Jim Wu, and João Araújo for feedback on earlier versions of this post.