Lemmatization

Reduce words to their dictionary forms

Lemmatization is a text normalization technique in natural language processing that reduces words to their dictionary forms, known as lemma. For example, “building has floors” reduces to “build have floor” upon lemmatization.

Lemmatization is often used for:

  • Information retrieval for expanding search criteria
  • Reducing dimensionality of problems in text classification, sentiment analysis, or topic modeling

Lemmatization is a common text preprocessing step performed before building models with words using machine learning algorithms. Lemmatization removes affixes of words by using vocabulary and morphological analysis. That means lemmatization is often dependent on the part of speech of the word and its context.

A related approach to lemmatization is stemming. It is based on simple heuristic rules and is easier to implement and faster than lemmatization. But stemming often results roots or word parts that are not actual words, whereas lemmatization is more accurate and returns valid dictionary words. For applications that require preserving meanings of the words, lemmatization is more useful than stemming.

The differences between lemmatization and stemming are shown below.

Actual Word Lemmatization Stemming
Requirement Requirement Requir
Applied Apply Appli

To learn more about using lemmatization and building predictive models with text data with MATLAB, see Text Analytics Toolbox™.

See also: natural language processing, sentiment analysis, word2vec, stemming, n-gram, text mining with MATLAB, data science, deep learning, Deep Learning Toolbox™, Statistics and Machine Learning Toolbox™