Add token type details to documents


updatedDocuments = addTypeDetails(documents)
updatedDocuments = addTypeDetails(documents,'TopLevelDomains',domains)



updatedDocuments = addTypeDetails(documents) detects the token types in documents and updates the token details. The function adds type details to the tokens with unknown type only. To get the token types from updatedDocuments, use tokenDetails.


updatedDocuments = addTypeDetails(documents,'TopLevelDomains',domains) also specifies the top-level domains to use for web address detection.


Use addTypeDetails before using the lower, upper, and erasePunctuation functions as addTypeDetails uses information that is removed by these functions.


collapse all

Convert manually tokenized text into a tokenizedDocument object, setting the 'TokenizeMethod' option to 'none'.

str = ["For" "more" "information" "," "see" "" "."];
documents = tokenizedDocument(str,'TokenizeMethod','none')
documents = 

   7 tokens: For more information , see .

View the token details using the tokenDetails function.

tdetails = tokenDetails(documents)
tdetails=7×2 table
               Token               DocumentNumber
    ___________________________    ______________

    "For"                                1       
    "more"                               1       
    "information"                        1       
    ","                                  1       
    "see"                                1       
    ""          1       
    "."                                  1       

If you set 'TokenizeMethod' to 'none' in the call to the tokenizedDocument function, then it does not detect the types of the tokens. To add the token type details, use the addTypeDetails function.

documents = addTypeDetails(documents);

View the updated token details.

tdetails = tokenDetails(documents)
tdetails=7×3 table
               Token               DocumentNumber       Type    
    ___________________________    ______________    ___________

    "For"                                1           letters    
    "more"                               1           letters    
    "information"                        1           letters    
    ","                                  1           punctuation
    "see"                                1           letters    
    ""          1           web-address
    "."                                  1           punctuation

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array.

Top-level domains to use for web address detection, specified as a character vector, string array, or cell array of character vectors.

If you do not specify domains, then the function uses the output of topLevelDomains.

Example: ["com" "net" "org"]

Data Types: char | string | cell

Output Arguments

collapse all

Updated documents, returned as a tokenizedDocument array. To get the token details from updatedDocuments, use tokenDetails.

Introduced in R2018b