Latent Dirichlet Allocation | Notion

Description

Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

How it Works

LDA represents documents as mixtures of topics.
Each topic is modeled as a distribution over words.
LDA assumes that the words of each document are generated by a mixture model corresponding to a random mixture of latent topics.
The words in each document are assumed to be produced by a two-level generative process.

Benefits

LDA is a powerful tool for the automatic discovery and recognition of topics in text corpora.
It provides a simple way to analyze large volumes of unlabeled text.
LDA can be used to classify text in a document to a particular topic.
It can also be used to predict topic distribution of a document.

Limitations

LDA assumes documents are produced from a mixture model, which may not always hold true.
It also assumes that the order of the words in the document does not matter (bag-of-words assumption), which is not always the case.
LDA can be quite sensitive to the number of topics parameter, and it’s often not clear what value should be used.

Features

LDA is a three-level hierarchical Bayesian model where each item of a collection is modeled as a finite mixture over an underlying set of topics.
Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities.
The number of topics is a parameter of the LDA model and must be specified a priori.