addLemmaDetails

Add lemma forms of tokens to documents

Use addLemmaDetails to add lemma forms to documents.

The function supports English and Japanese text.

Syntax

updatedDocuments = addLemmaDetails(documents)

Description

example

updatedDocuments = addLemmaDetails(documents) adds lemma details to documents and updates the token details. To get the lemma details from updatedDocuments, use tokenDetails.

Tip

Use addLemmaDetails before using the lower, upper, and normalizeWords functions as addLemmaDetails uses information that is removed by these functions.

Examples

collapse all

Create a tokenized document array.

str = [ ...
    "The dogs ran after the cat."
    "I am building a house."];
documents = tokenizedDocument(str);

Add lemma details to the documents using addLemmaDetails. This function lemmatizes the text and adds the lemma form of each token to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addLemmaDetails(documents);
tdetails = tokenDetails(documents);
head(tdetails)
ans=8×6 table
     Token     DocumentNumber    LineNumber       Type        Language     Lemma 
    _______    ______________    __________    ___________    ________    _______

    "The"            1               1         letters           en       "the"  
    "dogs"           1               1         letters           en       "dog"  
    "ran"            1               1         letters           en       "run"  
    "after"          1               1         letters           en       "after"
    "the"            1               1         letters           en       "the"  
    "cat"            1               1         letters           en       "cat"  
    "."              1               1         punctuation       en       "."    
    "I"              2               1         letters           en       "i"    

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Introduced in R2018b