textrankKeywords
Description
extracts keywords and respective scores using TextRank. The function supports English,
Japanese, German, and Korean text. For other languages, try using the tbl
= textrankKeywords(documents
)rakeKeywords
function
instead.
specifies additional options using one or more name-value pair arguments.tbl
= textrankKeywords(documents
,Name,Value
)
Examples
Extract Keywords Using TextRank
Create an array of tokenized document containing the text data.
textData = [ "MATLAB provides really useful tools for engineers. Scientists use many useful tools in MATLAB." "MATLAB and Simulink have many features. Use MATLAB and Simulink for engineering workflows." "Analyze text and images in MATLAB. Analyze text, images, and videos in MATLAB."]; documents = tokenizedDocument(textData);
Extract the keywords using the textrankKeywords
function.
tbl = textrankKeywords(documents)
tbl=7×3 table
Keyword DocumentNumber Score
_________________________________ ______________ ______
"many" "useful" "tools" 1 5.2174
"useful" "tools" "" 1 3.8778
"many" "features" "" 2 4.0815
"text" "" "" 3 1
"images" "" "" 3 1
"MATLAB" "" "" 3 1
"videos" "" "" 3 1
If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string ""
.
For readability, transform the multi-word keywords into a single string using the join
and strip
functions.
if size(tbl.Keyword,2) > 1 tbl.Keyword = strip(join(tbl.Keyword)); end tbl
tbl=7×3 table
Keyword DocumentNumber Score
___________________ ______________ ______
"many useful tools" 1 5.2174
"useful tools" 1 3.8778
"many features" 2 4.0815
"text" 3 1
"images" 3 1
"MATLAB" 3 1
"videos" 3 1
Specify Maximum Number of Keywords Per Document
Create an array of tokenized documents containing the text data.
textData = [ "MATLAB provides really useful tools for engineers. Scientists use many useful MATLAB toolboxes." "MATLAB and Simulink have many features. Use MATLAB and Simulink for engineering workflows." "Analyze text and images in MATLAB. Analyze text, images, and videos in MATLAB."]; documents = tokenizedDocument(textData);
Extract the top two keywords using the textrankKeywords
function and setting the 'MaxNumKeywords'
option to 2
.
tbl = textrankKeywords(documents,'MaxNumKeywords',2)
tbl=5×3 table
Keyword DocumentNumber Score
_____________________________________ ______________ ______
"useful" "MATLAB" "toolboxes" 1 4.8695
"useful" "" "" 1 2.3612
"many" "features" "" 2 4.0815
"text" "" "" 3 1
"images" "" "" 3 1
If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string ""
.
For readability, transform the multi-word keywords into a single string using the join
and strip
functions.
if size(tbl.Keyword,2) > 1 tbl.Keyword = strip(join(tbl.Keyword)); end tbl
tbl=5×3 table
Keyword DocumentNumber Score
_________________________ ______________ ______
"useful MATLAB toolboxes" 1 4.8695
"useful" 1 2.3612
"many features" 2 4.0815
"text" 3 1
"images" 3 1
Input Arguments
documents
— Input documents
tokenizedDocument
array | string array | cell array of character vectors
Input documents, specified as a tokenizedDocument
array, a string array of words, or a cell array of
character vectors. If documents
is not a
tokenizedDocument
array, then it must be a row vector representing
a single document, where each element is a word. To specify multiple documents, use a
tokenizedDocument
array.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: textrankKeywords(documents,'MaxNumKeywords',20)
returns at
most 20 keywords per document.
MaxNumKeywords
— Maximum number of keywords to return per document
Inf
(default) | positive integer
Maximum number of keywords to return per document, specified as a positive integer or
Inf
.
If MaxNumKeywords
is Inf
, then the function returns all identified keywords.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Window
— Size of co-occurrence window
2 (default) | positive integer | Inf
Size of co-occurrence window, specified as a the comma-separated pair consisting
of 'Window'
and a positive integer or
Inf
.
When the window size is 2, the function considers a co-occurrence between two
candidate keywords only when they appear consecutively in a document. When the window
size is Inf
, then the function considers a co-occurrence between
two candidate keywords when they both appear in the same document.
Increasing the window size enables the function to find more co-occurrences between keywords which increases the keyword importance scores. This can result in finding more relevant keywords at the cost of potentially over-scoring less relevant keywords.
For more information, see TextRank Keyword Extraction.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
PartOfSpeech
— Part-of-speech tags
["noun" "proper-noun" "adjective"]
(default) | string array | cell array of character vectors | character vector | categorical array
Part-of-speech tags to use to extract candidate keywords, specified as the
comma-separated pair consisting of 'PartOfSpeech'
and a string
array, cell array of character vectors, or a categorical array containing one or more
of the following class names:
adjective
— Adjectiveadposition
— Adpositionadverb
— Adverbauxiliary-verb
— Auxiliary verbcoord-conjunction
— Coordinating conjunctiondeterminer
— Determinerinterjection
— Interjectionnoun
— Nounnumeral
— Numeralparticle
— Particlepronoun
— Pronounproper-noun
— Proper nounpunctuation
— Punctuationsubord-conjunction
— Subordinating conjunctionsymbol
— Symbolverb
— Verbother
— Other
If PartOfSpeech
is a character vector, then it must
correspond to a single part-of-speech tag.
For more information, see TextRank Keyword Extraction.
Data Types: char
| string
| cell
| categorical
Output Arguments
tbl
— Extracted keywords and scores
table
Extracted keywords and scores, returned as a table with the following variables:
Keyword
– Extracted keyword, specified as a 1-by-maxNgramLength
string array, wheremaxNgramLength
is the number of words in the longest keyword.DocumentNumber
– Document number containing the corresponding keyword.Score
– Score of keyword.
The function merges multiple keywords into a single keyword when they appear consecutively in the corresponding document.
If a keyword contains multiple words, then the ith element of the corresponding string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string ""
.
For more information, see TextRank Keyword Extraction.
More About
Language Considerations
The textrankKeywords
function supports English, Japanese, German, and Korean text only.
The textrankKeywords
function extracts keywords by identifying candidate
keywords based on their part-of-speech tag. The function uses part-of-speech tags given by
the addPartOfSpeechDetails
function which supports English, Japanese, German,
and Korean text only.
For other languages, try using the rakeKeywords
instead and specify an appropriate set of delimiters using the 'Delimiters'
and 'MergingDelimiters'
options.
Tips
You can experiment with different keyword extraction algorithms to see what works best with your data. Because the TextRank keywords algorithm uses a part-of-speech tag-based approach to extract candidate keywords, the extracted keywords can be short. Alternatively, you can try extracting keywords using RAKE algorithm which extracts sequences of tokens appearing between delimiters as candidate keywords. To extract keywords using RAKE, use the
rakeKeywords
function. To learn more, see Extract Keywords from Text Data Using RAKE.
Algorithms
TextRank Keyword Extraction
For each document, the textrankKeywords
function extracts keywords
independently using the following steps based on [1]:
Determine candidate keywords:
Extract tokens with part-of-speech specified by the
'PartOfSpeech'
option.
Calculate scores for each candidate:
Create an undirected, unweighted graph with nodes corresponding to the candidate keywords.
Add edges between nodes where candidate keywords appear within a window of tokens, where the window size is given by the
'Window'
option.Compute the centrality of each node using the PageRank algorithm and weight the scores according to the number of candidate keywords. For more information, see
centrality
.
Extract top keywords from candidates:
Select the top third of the candidate keywords according to their scores.
If any of the candidate keywords appear consecutively in a document, then merge them into a single keyword and sum the corresponding scores.
Return the top k keywords, where k is given by the
'MaxNumKeywords'
option.
Language Details
tokenizedDocument
objects contain details about the tokens including language
details. The language details of the input documents determine the behavior of
textrankKeywords
. The tokenizedDocument
function, by default, automatically detects the language of
the input text. To specify the language details manually, use the
Language
option of tokenizedDocument
. To view the token details, use the tokenDetails
function.
References
[1] Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing Order into Text." In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404-411. 2004.
Version History
Introduced in R2020b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)