Lemmatization is a process in natural language processing (NLP) where words are reduced to their base or dictionary form, known as the lemma. Unlike stemming, which crudely chops off the ends of words in the hope of achieving this goal often by removing common prefixes or suffixes, lemmatization considers the context and morphological analysis of words, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
The lemma of a word is its canonical form or base form. For example:
Lemmatization is more sophisticated than stemming and requires understanding the part of speech (POS) of a word in a sentence, as well as its meaning in context. It's typically used in tasks that require high levels of accuracy and nuance in text interpretation, such as semantic reasoning tasks, where the precise meaning of text is important. This process is computationally more expensive than stemming because it involves looking up words in a vocabulary and performing a morphological analysis to accurately retrieve the correct lemma.