tokenizedDocument
Array of tokenized documents for text analysis
Description
A tokenized document is a document represented as a collection of words (also known as tokens) which is used for text analysis.
Use tokenized documents to:
Detect complex tokens in text, such as web addresses, emoticons, emoji, and hashtags.
Remove words such as stop words using the
removeWords
orremoveStopWords
functions.Perform word-level preprocessing tasks such as stemming or lemmatization using the
normalizeWords
function.Analyze word and n-gram frequencies using
bagOfWords
andbagOfNgrams
objects.Add sentence and part-of-speech details using the
addSentenceDetails
andaddPartOfSpeechDetails
functions.Add entity tags using the
addEntityDetails
function.Add grammatical dependency details using the
addDependencyDetails
function.View details about the tokens using the
tokenDetails
function.
The function supports English, Japanese, German, and Korean text. To learn how to use
tokenizedDocument
for other languages, see Language Considerations.
Creation
Syntax
Description
creates a scalar tokenized document with no tokens.documents
= tokenizedDocument
tokenizes the elements of a string array and returns a tokenized document
array.documents
= tokenizedDocument(str
)
specifies additional options using one or more name-value pair arguments.documents
= tokenizedDocument(str
,Name,Value
)
Input Arguments
str
— Input text
string array | character vector | cell array of character vectors | cell array of string arrays
Input text, specified as a string array, character vector, cell array of character vectors, or cell array of string arrays.
If the input text has not already been split into words, then
str
must be a string array, character vector,
cell array of character vectors, or a cell array of string
scalars.
Example: ["an example of a short document";"a second short
document"]
Example: 'an example of a single
document'
Example: {'an example of a short document';'a second short
document'}
If the input text has already been split into words, then
specify TokenizeMethod
to be "none"
. If
str
contains a single document, then it must be a string vector of
words, a row cell array of character vectors, or a cell array containing a single string vector
of words. If str
contains multiple documents, then it must be a cell array
of string arrays.
Example: ["an" "example" "document"]
Example: {'an','example','document'}
Example: {["an" "example" "of" "a" "short"
"document"]}
Example: {["an" "example" "of" "a" "short" "document"];["a"
"second" "short" "document"]}
Data Types: string
| char
| cell
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: DetectPatterns={'email-address','web-address'}
detects email addresses and web addresses
TokenizeMethod
— Method to tokenize documents
"unicode"
| "mecab"
| mecabOptions
object | "none"
Method to tokenize documents, specified as one of these values:
"unicode"
– Tokenize input text using rules based on Unicode® Standard Annex #29 [1] and the ICU tokenizer [2]. Ifstr
is a cell array, then the elements ofstr
must be string scalars or character vectors. IfLanguage
is"en"
or"de"
, then"unicode"
is the default."mecab"
– Tokenize Japanese and Korean text using the MeCab tokenizer [3]. IfLanguage
is"ja"
or"ko"
, then"mecab"
is the default.mecabOptions
object – Tokenize Japanese and Korean text using the MeCab options specified by amecabOptions
object."none"
– Do not tokenize the input text.
If the input text has already been split into words, then
specify TokenizeMethod
to be "none"
. If
str
contains a single document, then it must be a string vector of
words, a row cell array of character vectors, or a cell array containing a single string vector
of words. If str
contains multiple documents, then it must be a cell array
of string arrays.
DetectPatterns
— Patterns of complex tokens to detect
"all"
(default) | character vector | string array | cell array of character vectors
Patterns of complex tokens to detect, specified as
"none"
, "all"
, or a string
or cell array containing one or more of these values:
"email-address"
– Detect email addresses. For example, treat"user@domain.com"
as a single token."web-address"
– Detect web addresses. For example, treat"https://www.mathworks.com"
as a single token."hashtag"
– Detect hashtags. For example, treat"#MATLAB"
as a single token."at-mention"
– Detect at-mentions. For example, treat"@MathWorks"
as a single token."emoticon"
– Detect emoticons. For example, treat":-D"
as a single token.
If DetectPatterns
is
"none"
, then the function does not detect any
complex token patterns. If DetectPatterns
is
"all"
, then the function detects all the
listed complex token patterns.
Example: DetectPatterns="hashtag"
Example: DetectPatterns={'email-address','web-address'}
Data Types: char
| string
| cell
CustomTokens
— Custom tokens to detect
''
(default) | string array | character vector | cell array of character vectors | table
Custom tokens to detect, specified as one of these values:
A string array, character vector, or cell array of character vectors containing the custom tokens.
A table containing the custom tokens in a column named
Token
and the corresponding token types a column namedType
.
If you specify the custom tokens as a string array, character
vector, or cell array of character vectors, then the function
assigns token type "custom"
. To specify a custom
token type, use table input. To view the token types, use the
tokenDetails
function.
When there are two or more conflicting custom tokens, the function uses the longest one. When a custom token conflicts with a regular expression, the function uses the regular expression.
Example: CustomTokens=["C++"
"C#"]
Data Types: char
| string
| table
| cell
RegularExpressions
— Regular expressions to detect
''
(default) | string array | character vector | cell array of character vectors | table
Regular expressions to detect, specified as one of these values.
A string array, character vector, or cell array of character vectors containing regular expressions.
A table containing regular expressions a column named
Pattern
and the corresponding token types in a column namedType
.
If you specify the regular expressions as a string array,
character vector, or cell array of character vectors, then the
function assigns token type "custom"
. To specify
a custom token type, use table input. To view the token types, use
the tokenDetails
function.
When there are two or more conflicting regular expressions, the function uses the last match. When a custom token conflicts with a regular expression, the function uses the regular expression.
Example: RegularExpressions=["ver:\d+"
"rev:\d+"]
Data Types: char
| string
| table
| cell
TopLevelDomains
— Top-level domains to use for web address detection
character vector | string array | cell array of character vectors
Top-level domains to use for web address detection, specified as a
character vector, string array, or cell array of character vectors.
By default, the function uses the output of the topLevelDomains
function.
This option only applies if DetectPatterns
is
"all"
or contains
"web-address"
.
Example: TopLevelDomains=["com" "net"
"org"]
Data Types: char
| string
| cell
Language
— Language
"en"
| "ja"
| "de"
| "ko"
Language, specified as one of these options:
"en"
– English. This option also sets the default value forTokenizeMethod
to"unicode"
."ja"
– Japanese. This option also sets the default value forTokenizeMethod
to"mecab"
."de"
– German. This option also sets the default value forTokenizeMethod
to"unicode"
."ko"
– Korean. This option also sets the default value forTokenizeMethod
to"mecab"
.
If you do not specify a value, then the function detects the language from the input text
using the corpusLanguage
function.
This option specifies the language details of the tokens. To view the language details of the tokens, use tokenDetails
. These language details determine the behavior of the removeStopWords
,
addPartOfSpeechDetails
, normalizeWords
, addSentenceDetails
, and addEntityDetails
functions on the tokens.
For more information about language support in Text Analytics Toolbox™, see Language Considerations.
Example: Language="ja"
Properties
Vocabulary
— Unique words in the documents
string array
Unique words in the documents, specified as a string array. The words do not appear in any particular order.
Data Types: string
Object Functions
Preprocessing
erasePunctuation | Erase punctuation from text and documents |
removeStopWords | Remove stop words from documents |
removeWords | Remove selected words from documents or bag-of-words model |
normalizeWords | Stem or lemmatize words |
correctSpelling | Correct spelling of words |
replaceWords | Replace words in documents |
replaceNgrams | Replace n-grams in documents |
removeEmptyDocuments | Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model |
lower | Convert documents to lowercase |
upper | Convert documents to uppercase |
Tokens Details
tokenDetails | Details of tokens in tokenized document array |
addSentenceDetails | Add sentence numbers to documents |
addPartOfSpeechDetails | Add part-of-speech tags to documents |
addLanguageDetails | Add language identifiers to documents |
addTypeDetails | Add token type details to documents |
addLemmaDetails | Add lemma forms of tokens to documents |
addEntityDetails | Add entity tags to documents |
addDependencyDetails | Add grammatical dependency details to documents |
Export
writeTextDocument | Write documents to text file |
Manipulation and Conversion
doclength | Length of documents in document array |
context | Search documents for word or n-gram occurrences in context |
contains | Check if pattern is substring in documents |
containsWords | Check if word is member of documents |
containsNgrams | Check if n-gram is member of documents |
splitSentences | Split text into sentences |
joinWords | Convert documents to string by joining words |
doc2cell | Convert documents to cell array of string vectors |
string | Convert scalar document to string vector |
plus | Append documents |
replace | Replace substrings in documents |
docfun | Apply function to words in documents |
regexprep | Replace text in words of documents using regular expression |
Display
wordcloud | Create word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model |
sentenceChart | Plot grammatical dependency parse tree of sentence |
Examples
Tokenize Text
Create tokenized documents from a string array.
str = [ "an example of a short sentence" "a second short sentence"]
str = 2x1 string
"an example of a short sentence"
"a second short sentence"
documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 6 tokens: an example of a short sentence 4 tokens: a second short sentence
Detect Complex Tokens
Create a tokenized document from the string str
. By default, the function treats the hashtag "#MATLAB"
, the emoticon ":-D"
, and the web address "https://www.mathworks.com/help"
as single tokens.
str = "Learn how to analyze text in #MATLAB! :-D see https://www.mathworks.com/help/";
document = tokenizedDocument(str)
document = tokenizedDocument: 11 tokens: Learn how to analyze text in #MATLAB ! :-D see https://www.mathworks.com/help/
To detect only hashtags as complex tokens, specify the 'DetectPatterns'
option to be 'hashtag'
only. The function then tokenizes the emoticon ":-D"
and the web address "https://www.mathworks.com/help"
into multiple tokens.
document = tokenizedDocument(str,'DetectPatterns','hashtag')
document = tokenizedDocument: 24 tokens: Learn how to analyze text in #MATLAB ! : - D see https : / / www . mathworks . com / help /
Remove Stop Words from Documents
Remove the stop words from an array of documents using removeStopWords
. The tokenizedDocument
function detects that the documents are in English, so removeStopWords
removes English stop words.
documents = tokenizedDocument([ "an example of a short sentence" "a second short sentence"]); newDocuments = removeStopWords(documents)
newDocuments = 2x1 tokenizedDocument: 3 tokens: example short sentence 3 tokens: second short sentence
Stem Words in Documents
Stem the words in a document array using the Porter stemmer.
documents = tokenizedDocument([ "a strongly worded collection of words" "another collection of words"]); newDocuments = normalizeWords(documents)
newDocuments = 2x1 tokenizedDocument: 6 tokens: a strongli word collect of word 4 tokens: anoth collect of word
Specify Custom Tokens
The tokenizedDocument
function, by default, splits words and tokens that contain symbols. For example, the function splits "C++" and "C#" into multiple tokens.
str = "I am experienced in MATLAB, C++, and C#.";
documents = tokenizedDocument(str)
documents = tokenizedDocument: 14 tokens: I am experienced in MATLAB , C + + , and C # .
To prevent the function from splitting tokens that contain symbols, specify custom tokens using the 'CustomTokens'
option.
documents = tokenizedDocument(str,'CustomTokens',["C++" "C#"])
documents = tokenizedDocument: 11 tokens: I am experienced in MATLAB , C++ , and C# .
The custom tokens have token type "custom"
. View the token details. The column Type
contains the token types.
tdetails = tokenDetails(documents)
tdetails=11×5 table
Token DocumentNumber LineNumber Type Language
_____________ ______________ __________ ___________ ________
"I" 1 1 letters en
"am" 1 1 letters en
"experienced" 1 1 letters en
"in" 1 1 letters en
"MATLAB" 1 1 letters en
"," 1 1 punctuation en
"C++" 1 1 custom en
"," 1 1 punctuation en
"and" 1 1 letters en
"C#" 1 1 custom en
"." 1 1 punctuation en
To specify your own token types, input the custom tokens as a table with the tokens in a column named Token
, and the types in a column named Type
. To assign a custom type to a token that doesn't include symbols, include in the table too. For example, create a table that will assign "MATLAB", "C++", and "C#" to the "programming-language"
token type.
T = table; T.Token = ["MATLAB" "C++" "C#"]'; T.Type = ["programming-language" "programming-language" "programming-language"]'
T=3×2 table
Token Type
________ ______________________
"MATLAB" "programming-language"
"C++" "programming-language"
"C#" "programming-language"
Tokenize the text using the table of custom tokens and view the token details.
documents = tokenizedDocument(str,'CustomTokens',T);
tdetails = tokenDetails(documents)
tdetails=11×5 table
Token DocumentNumber LineNumber Type Language
_____________ ______________ __________ ____________________ ________
"I" 1 1 letters en
"am" 1 1 letters en
"experienced" 1 1 letters en
"in" 1 1 letters en
"MATLAB" 1 1 programming-language en
"," 1 1 punctuation en
"C++" 1 1 programming-language en
"," 1 1 punctuation en
"and" 1 1 letters en
"C#" 1 1 programming-language en
"." 1 1 punctuation en
Specify Custom Tokens Using Regular Expressions
The tokenizedDocument
function, by default, splits words and tokens containing symbols. For example, the function splits the text "ver:2"
into multiple tokens.
str = "Upgraded to ver:2 rev:3.";
documents = tokenizedDocument(str)
documents = tokenizedDocument: 9 tokens: Upgraded to ver : 2 rev : 3 .
To prevent the function from splitting tokens that have particular patterns, specify those patterns using the 'RegularExpressions'
option.
Specify regular expressions to detect tokens denoting version and revision numbers: strings of digits appearing after "ver:"
and "rev:"
respectively.
documents = tokenizedDocument(str,'RegularExpressions',["ver:\d+" "rev:\d+"])
documents = tokenizedDocument: 5 tokens: Upgraded to ver:2 rev:3 .
Custom tokens, by default, have token type "custom"
. View the token details. The column Type
contains the token types.
tdetails = tokenDetails(documents)
tdetails=5×5 table
Token DocumentNumber LineNumber Type Language
__________ ______________ __________ ___________ ________
"Upgraded" 1 1 letters en
"to" 1 1 letters en
"ver:2" 1 1 custom en
"rev:3" 1 1 custom en
"." 1 1 punctuation en
To specify your own token types, input the regular expressions as a table with the regular expressions in a column named Pattern
and the token types in a column named Type
.
T = table; T.Pattern = ["ver:\d+" "rev:\d+"]'; T.Type = ["version" "revision"]'
T=2×2 table
Pattern Type
_________ __________
"ver:\d+" "version"
"rev:\d+" "revision"
Tokenize the text using the table of custom tokens and view the token details.
documents = tokenizedDocument(str,'RegularExpressions',T);
tdetails = tokenDetails(documents)
tdetails=5×5 table
Token DocumentNumber LineNumber Type Language
__________ ______________ __________ ___________ ________
"Upgraded" 1 1 letters en
"to" 1 1 letters en
"ver:2" 1 1 version en
"rev:3" 1 1 revision en
"." 1 1 punctuation en
Search Documents for Word Occurrences
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Search for the word "life".
tbl = context(documents,"life");
head(tbl)
Context Document Word ________________________________________________________ ________ ____ "consumst thy self single life ah thou issueless shalt " 9 10 "ainted counterfeit lines life life repair times pencil" 16 35 "d counterfeit lines life life repair times pencil pupi" 16 36 " heaven knows tomb hides life shows half parts write b" 17 14 "he eyes long lives gives life thee " 18 69 "tender embassy love thee life made four two alone sink" 45 23 "ves beauty though lovers life beauty shall black lines" 63 50 "s shorn away live second life second head ere beautys " 68 27
View the occurrences in a string array.
tbl.Context
ans = 23x1 string
"consumst thy self single life ah thou issueless shalt "
"ainted counterfeit lines life life repair times pencil"
"d counterfeit lines life life repair times pencil pupi"
" heaven knows tomb hides life shows half parts write b"
"he eyes long lives gives life thee "
"tender embassy love thee life made four two alone sink"
"ves beauty though lovers life beauty shall black lines"
"s shorn away live second life second head ere beautys "
"e rehearse let love even life decay lest wise world lo"
"st bail shall carry away life hath line interest memor"
"art thou hast lost dregs life prey worms body dead cow"
" thoughts food life sweetseasond showers gro"
"tten name hence immortal life shall though once gone w"
" beauty mute others give life bring tomb lives life fa"
"ve life bring tomb lives life fair eyes poets praise d"
" steal thyself away term life thou art assured mine li"
"fe thou art assured mine life longer thy love stay dep"
" fear worst wrongs least life hath end better state be"
"anst vex inconstant mind life thy revolt doth lie o ha"
" fame faster time wastes life thou preventst scythe cr"
"ess harmful deeds better life provide public means pub"
"ate hate away threw savd life saying "
" many nymphs vowd chaste life keep came tripping maide"
Tokenize Japanese Text
Tokenize Japanese text using tokenizedDocument
. The function automatically detects Japanese text.
str = [ "恋に悩み、苦しむ。" "恋の悩みで苦しむ。" "空に星が輝き、瞬いている。" "空の星が輝きを増している。"]; documents = tokenizedDocument(str)
documents = 4x1 tokenizedDocument: 6 tokens: 恋 に 悩み 、 苦しむ 。 6 tokens: 恋 の 悩み で 苦しむ 。 10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。 10 tokens: 空 の 星 が 輝き を 増し て いる 。
Tokenize German Text
Tokenize German text using tokenizedDocument
. The function automatically detects German text.
str = [ "Guten Morgen. Wie geht es dir?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .
More About
Language Considerations
The tokenizedDocument
function has built-in rules for English, Japanese, German,
and Korean only. For English and German text, the 'unicode'
tokenization
method of tokenizedDocument
detects tokens using rules based on Unicode Standard Annex #29 [1] and the ICU tokenizer [2], modified to better detect
complex tokens such as hashtags and URLs. For Japanese and Korean text, the
'mecab'
tokenization method detects tokens using rules based on the
MeCab tokenizer [3].
For other languages, you can still try using tokenizedDocument
. If
tokenizedDocument
does not produce useful
results, then try tokenizing the text manually. To create a
tokenizedDocument
array from manually tokenized
text, set the 'TokenizeMethod'
option to
'none'
.
For more information, see Language Considerations.
References
[1] Unicode Text Segmentation. https://www.unicode.org/reports/tr29/
[2] Boundary Analysis. https://unicode-org.github.io/icu/userguide/boundaryanalysis/
[3] MeCab: Yet Another Part-of-Speech and Morphological Analyzer. https://taku910.github.io/mecab/
Version History
Introduced in R2017bR2022a: tokenizedDocument
does not split tokens containing digits and some special characters
Starting in R2022a, tokenizedDocument
does not split some tokens
where digits appear next to some special characters such as periods, hyphens,
colons, slashes, and scientific notation. This behavior can produce better results
when tokenizing text containing numbers, dates, and times.
In previous versions, tokenizedDocument
might split at these
characters. To reproduce the behavior, tokenize the text manually or insert
whitespace characters around special characters before using
tokenizedDocument
.
R2019b: tokenizedDocument
detects Korean language
Starting in R2019b, tokenizedDocument
detects the Korean language
and sets the 'Language'
option to 'ko'
. This changes the
default behavior of the addSentenceDetails
, addPartOfSpeechDetails
, removeStopWords
, and normalizeWords
functions for Korean document input. This change
allows the software to use Korean-specific rules and word lists for analysis. If
tokenizedDocument
incorrectly detects text as Korean, then you
can specify the language manually by setting the 'Language'
name-value pair of tokenizedDocument
.
In previous versions, tokenizedDocument
usually detects Korean
text as English and sets the 'Language'
option to 'en'
. To reproduce this
behavior, manually set the 'Language'
name-value pair of tokenizedDocument
to 'en'
.
R2018b: tokenizedDocument
detects emoticons
Starting in R2018b, tokenizedDocument
, by default, detects
emoticon tokens. This behavior makes it easier to analyze text containing
emoticons.
In R2017b and R2018a, tokenizedDocument
splits emoticon tokens
into multiple tokens. To reproduce this behavior, in
tokenizedDocument
, specify the
'DetectPatterns'
option to be
{'email-address','web-address','hashtag','at-mention'}
.
R2018b: tokenDetails
returns token type emoji
for emoji characters
Starting in R2018b, tokenizedDocument
detects emoji characters and the tokenDetails
function reports these tokens with type
"emoji"
. This makes it easier to analyze text containing emoji
characters.
In R2018a, tokenDetails
reports emoji characters with type "other"
.
To find the indices of the tokens with type "emoji"
or
"other"
, use the indices idx = tdetails.Type == "emoji" |
tdetails.Type == "other"
, where tdetails
is a table of
token details.
R2018b: tokenizedDocument
does not split at slash and colon characters between digits
Starting in R2018b, tokenizedDocument
does not split at slash,
backslash, or colon characters when they appear between two digits. This behavior
can produce better results when tokenizing text containing dates and times.
In previous versions, tokenizedDocument
splits at these
characters. To reproduce the behavior, tokenize the text manually or insert
whitespace characters around slash, backslash, and colon characters before using
tokenizedDocument
.
See Also
removeWords
| removeStopWords
| normalizeWords
| removeEmptyDocuments
| addSentenceDetails
| addPartOfSpeechDetails
| tokenDetails
| context
| joinWords
| bagOfWords
| bagOfNgrams
| replaceWords
| replaceNgrams
| addEntityDetails
Topics
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)