Easy, Fast, and Effective Topic Modeling For Beginners with FASTopic

Community Article Published August 23, 2024

Author: Xiaobao Wu

stars PyPI arXiv

Introduction

What is Topic Modeling?

Topic modeling is a technique used in NLP and machine learning to automatically discover the latent topics that occur within a large collection of documents or text data. It works by analyzing the patterns of word co-occurrence across documents and grouping words that frequently appear together into topics.

Each topic is typically represented as a distribution of words, and each document is represented as a mixture of these topics, with certain topics being more prominent in some documents than others. This allows for the categorization and summarization of large volumes of text data, helping users to understand the underlying themes and trends within the data.

Topic modeling is widely used in applications such as document classification, text summarization, information retrieval, and sentiment analysis, making it a valuable tool for extracting meaningful information from unstructured text data.

What is FASTopic?

Previous topic models can be classified into three types: (1) Conventional Topic Models like LDA. They usually use Gibbs Sampling or Variational Inference to learn topics. (2) VAE-based Neural Topic Models like ProdLDA and ETM. They leverage the Variational AutoEncoder (VAE) to model topics. (3) Clustering-based Neural Topic Models like Top2Vec and BERTopic. They cluster document embeddings and extract significant words from document clusters.

Differently from these work, FASTopic models the optimal transport plans between documents, topics, and words. FASTopic only needs document embeddings from pretrained Transformers, like sentence-transformers. It leverages the optimal transport plans between the document, topic, and word embeddings to model topics and topic distributions of documents.

FASTopic offers a powerful tool for users to understand documents. It is user-friendly, highly fast, effective, stable, and transferable. Users can employ FASTopic in diverse fields, like business intelligence, academic research, news and media, healthcare, legal, and marketing. With its versatility, FASTopic adapts to various domains, providing valuable insights and improving the efficiency of text analysis tasks across multiple industries.

Why FASTopic?

  1. Extremely Fast Speed. FASTopic doesn't need the Gibbs Sampling of LDA, complicated VAE structures of neural topic models, or the dimensionality reduction and clustering process of BERTopic. FASTopic directly uses the fast Sinkhorn's algorithm to solve the optimal transport between document, topic, and word embeddings.

  1. High Effectiveness. FASTopic shows strong performance on topic coherence, topic diversity, and inference ability on topic distributions of documents.

  1. Simple Architecture. FASTopic has a simple architecture with limited hyperparameters. Users can avoid the complicated and frustrating hyperparameter fine-tuning.

  2. High Transferability. FASTopic trained on one dataset can show high transferability on another dataset.

Quickstart: How to use FASTopic

We introduce how to quickly use FASTopic to handle your datasets.

  1. Install FASTopic with pip.
pip install fastopic
  1. Pass your dataset.
from fastopic import FASTopic
from topmost.preprocessing import Preprocessing

# Prepare your dataset.
docs = [
    'doc 1',
    'doc 2', # ...
]

# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc.
# Pass your tokenizer as:
#   preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
preprocessing = Preprocessing(stopwords='English')

model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

topic_top_words is a list containing the top words of discovered topics. doc_topic_dist is the topic distributions of documents (doc-topic distributions), a numpy array with shape N×KN×K (number of documents NN and number of topics KK).

Tutorial: Use FASTopic to analyze the News of the New York Times.

The code of this tutorial is available at Colab:

  1. Prepare a Dataset.

We download preprocessed dataset NYT, news articles from the New York Times.

import topmost
from topmost.data import download_dataset
from fastopic import FASTopic
download_dataset("NYT", cache_path="./datasets")
dataset = topmost.data.DynamicDataset("./datasets/NYT", as_tensor=False)
docs = dataset.train_texts
  1. Train FASTopic.
model = FASTopic(num_topics=50, verbose=True)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
  1. Topic information.

We can get the top words and their probabilities of a topic.

model.get_topic(topic_idx=36)

(('cancer', 0.004797671),
 ('monkeypox', 0.0044828397),
 ('certificates', 0.004410268),
 ('redfield', 0.004407463),
 ('administering', 0.0043857736))
  1. Visualize these topics.
fig = model.visualize_topic(top_n=5)
fig.show()

  1. Topic hierarchy.

We use the learned topic embeddings and scipy.cluster.hierarchy to build a hierarchy of discovered topics.

fig = model.visualize_topic_hierarchy()
fig.show()

  1. Topic weights.

We plot the weights of topics in the given dataset.

fig = model.visualize_topic_weights(top_n=20, height=500)
fig.show()

  1. Topic activity over time.

Topic activity refers to the weight of a topic at a time slice. We additionally input the time slices of documents, time_slices to compute and plot topic activity over time.

time_slices = dataset.train_times
act = model.topic_activity_over_time(time_slices)
fig = model.visualize_topic_activity(top_n=6, topic_activity=act, time_slices=time_slices)
fig.show()

References:

FASTopic Github repo: https://github.com/bobxwu/FASTopic
FASTopic paper: https://arxiv.org/abs/2405.17978
TopMost Github repo: https://github.com/bobxwu/topmost

Contact: Xiaobao Wu