Description
SageMaker LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. These categories are themselves a probability distribution over the features. It is most commonly associated with topic modeling in text corpuses.
How it Works
- SageMaker LDA works by assuming that documents are formed by sampling words from a finite set of topics.
- It is a generative model where each document is generated word-by-word by choosing a topic mixture.
- For each word in the document, a topic is chosen and a word is drawn from the corresponding topic-word distribution.
- When training the model, the goal is to find parameters that maximize the probability that the text corpus is generated by the model.
Benefits
- SageMaker LDA provides theoretical guarantees on results.
- It is embarrassingly parallelizable, meaning the work can be trivially divided over input documents in both training and inference.
- It is fast, as it uses tensor spectral decomposition, which has low iteration cost and is not prone to slow convergence rates.
Limitations
- SageMaker LDA assumes that documents are formed by sampling words from a finite set of topics, which may not always be the case.
- It is a “bag-of-words” model, which means that the order of words does not matter, potentially losing some information.
- It requires a reasonable level of statistical understanding to interpret correctly.
Features
- SageMaker LDA is defined by two parameters: a prior estimate on topic probability and a collection of topics where each topic is given a probability distribution over the vocabulary used in a document corpus.
- It is a generative model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables.
- It uses tensor spectral decomposition for estimating the LDA model, providing several advantages over other methods like Gibbs sampling or Expectation Maximization (EM) techniques.
Use Cases