Main Content

rakeKeywords

Extract keywords using RAKE

Since R2020b

    Description

    tbl = rakeKeywords(documents) extracts keywords and respective scores using the Rapid Automatic Keyword Extraction (RAKE) algorithm. The function supports English, Japanese, German, and Korean text. To learn how to use rakeKeywords for other languages, see Language Considerations.

    example

    tbl = rakeKeywords(documents,Name=Value) specifies additional options using one or more name-value arguments.

    Tip

    The rakeKeywords function, by default, extracts keywords using stop words and punctuation characters. When using the default values for the Delimiters and MergingDelimiters options, do not remove stop words or punctuation characters from the input text.

    example

    Examples

    collapse all

    Create an array of tokenized documents containing the text data.

    textData = [
        "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
        "Analyze text and images. You can import text and images."
        "Analyze text and images. Analyze text, images, and videos in MATLAB."];
    documents = tokenizedDocument(textData);

    Extract the keywords using the rakeKeywords function.

    tbl = rakeKeywords(documents)
    tbl=12×3 table
                         Keyword                     DocumentNumber    Score
        _________________________________________    ______________    _____
    
        "MATLAB"        "provides"    "tools"              1             8  
        "MATLAB"        ""            ""                   1             2  
        "scientists"    "and"         "engineers"          1             2  
        "scientists"    ""            ""                   1             1  
        "engineers"     ""            ""                   1             1  
        "Analyze"       "text"        ""                   2             4  
        "import"        "text"        ""                   2             4  
        "images"        ""            ""                   2             1  
        "Analyze"       "text"        ""                   3             4  
        "images"        ""            ""                   3             1  
        "videos"        ""            ""                   3             1  
        "MATLAB"        ""            ""                   3             1  
    
    

    If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

    For readability, transform the multi-word keywords into a single string using the join and strip functions.

    if size(tbl.Keyword,2) > 1
        tbl.Keyword = strip(join(tbl.Keyword));
    end
    tbl
    tbl=12×3 table
                 Keyword              DocumentNumber    Score
        __________________________    ______________    _____
    
        "MATLAB provides tools"             1             8  
        "MATLAB"                            1             2  
        "scientists and engineers"          1             2  
        "scientists"                        1             1  
        "engineers"                         1             1  
        "Analyze text"                      2             4  
        "import text"                       2             4  
        "images"                            2             1  
        "Analyze text"                      3             4  
        "images"                            3             1  
        "videos"                            3             1  
        "MATLAB"                            3             1  
    
    

    Create an array of tokenized document containing the text data.

    textData = [
        "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
        "Analyze text and images. You can import text and images."
        "Analyze text and images. Analyze text, images, and videos in MATLAB."];
    documents = tokenizedDocument(textData);

    Extract the top two keywords using the rakeKeywords function and setting the MaxNumKeywords option to 2.

    tbl = rakeKeywords(documents,MaxNumKeywords=2)
    tbl=6×3 table
                     Keyword                  DocumentNumber    Score
        __________________________________    ______________    _____
    
        "MATLAB"     "provides"    "tools"          1             8  
        "MATLAB"     ""            ""               1             2  
        "Analyze"    "text"        ""               2             4  
        "import"     "text"        ""               2             4  
        "Analyze"    "text"        ""               3             4  
        "images"     ""            ""               3             1  
    
    

    If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

    For readability, transform the multi-word keywords into a single string using the join and strip functions.

    if size(tbl.Keyword,2) > 1
        tbl.Keyword = strip(join(tbl.Keyword));
    end
    tbl
    tbl=6×3 table
                Keyword            DocumentNumber    Score
        _______________________    ______________    _____
    
        "MATLAB provides tools"          1             8  
        "MATLAB"                         1             2  
        "Analyze text"                   2             4  
        "import text"                    2             4  
        "Analyze text"                   3             4  
        "images"                         3             1  
    
    

    Input Arguments

    collapse all

    Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

    Example: rakeKeywords(documents,MaxNumKeywords=20) returns at most 20 keywords per document.

    Maximum number of keywords to return per document, specified as a positive integer or Inf.

    If MaxNumKeywords is Inf, then the function returns all identified keywords.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Option to ignore keyword case, specified as one of the following:

    • 0 (false) – extract case-sensitive keywords.

    • 1 (true) – extract keywords ignoring case. Use this option when you expect the same keywords to appear with variations in letter case and want to treat them as the same keyword, for example, the words "analytics", "Analytics", and "ANALYTICS".

    When IgnoreKeywordCase is 1, the function returns keywords with the most commonly occurring letter case pattern. When two or more patterns appear with the same frequency, then the function returns the keyword with the letter case pattern that occurs first in the input.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical

    Tokens for splitting documents into keywords, specified a string array, a character vector, or a cell array of character vectors. If Delimiters is a character vector, then it must represent a single delimiter.

    The default list of delimiters is a list of punctuation characters.

    If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    To specify delimiters for merging, use the MergingDelimiters option.

    Data Types: char | string | cell

    Delimiters also used for merging keywords, specified as a string array, a character vector, or a cell array of character vectors. If MergingDelimiters is a character vector, then it must represent a single delimiter.

    The default list of merging delimiters is the list of stop words given by the stopWords function.

    If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    To specify delimiters that should not be used for merging, use the Delimiters option.

    Data Types: char | string | cell

    Option to ignore delimiter case, specified as one of the following:

    • 1 (true) – ignore delimiter case.

    • 0 (false) – use case-sensitive delimiters. Use this option when you expect there to be keywords and delimiters differ only by case, for example the delimiter "and" and the acronym "AND".

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical

    Output Arguments

    collapse all

    Extracted keywords and scores, returned as a table with the following variables:

    • Keyword – Extracted keyword, specified as a 1-by-maxNgramLength string array, where maxNgramLength is the number of words in the longest keyword.

    • DocumentNumber – Document number containing the corresponding keyword.

    • Score – Score of keyword.

    If multiple candidate keywords appear in a document separated only by merging delimiters, then the function merges those keywords and the merging delimiters into a single keyword.

    If a keyword contains multiple words, then the ith element of the corresponding string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

    For more information, see Rapid Automatic Keyword Extraction.

    More About

    collapse all

    Language Considerations

    The rakeKeywords function supports English, Japanese, German, and Korean text only.

    The rakeKeywords function extracts keywords using a delimiter-based approach to identify candidate keywords. The function, by default, uses punctuation characters and the stop words given by the stopWords with language given by the language details of the input documents as delimiters.

    For other languages, specify an appropriate set of delimiters using the Delimiters and MergingDelimiters options.

    Tips

    • You can experiment with different keyword extraction algorithms to see what works best with your data. Because the RAKE keywords algorithm uses a delimiter-based approach to extract candidate keywords, the extracted keywords can be very long. Alternatively, you can try extracting keywords using TextRank algorithm which starts with individual tokens as candidate keywords and then merges them when appropriate. To extract keywords using TextRank, use the textrankKeywords function. To learn more, see Extract Keywords from Text Data Using TextRank.

    Algorithms

    collapse all

    Rapid Automatic Keyword Extraction

    For each document, the rakeKeywords function extracts keywords independently using the following steps based on [1]:

    1. Determine candidate keywords:

      • Extract sequences of tokens between the delimiters specified by the Delimiters and MergingDelimiters options. The function treats each sequence as a single candidate keyword.

    2. Calculate scores for the candidate keywords:

      • Create an undirected, unweighted graph with nodes corresponding to the individual tokens in the candidate keywords.

      • Add edges between nodes where tokens co-occur in a candidate keyword, including self co-occurrences, weighted by the number of candidate keywords containing that co-occurrence.

      • Score each token using the formula deg(token) / freq(token), where deg(token) is the number of edges for the specified token and freq(token) is the number of times that the specified token occurs in the document.

      • For each candidate keyword, assign a score given by the sum of scores of the contained tokens.

    3. Extract top keywords from candidates:

      • If there are multiple instances of the same pair of candidate keywords separated by the same single merging delimiter, then merge the candidate keywords and the delimiter into a single keyword and sum the corresponding scores.

      • Return the top k keywords, where k is given by the MaxNumKeywords option.

    Language Details

    tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior of rakeKeywords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use the Language option of tokenizedDocument. To view the token details, use the tokenDetails function.

    References

    [1] Rose, Stuart, Dave Engel, Nick Cramer, and Wendy Cowley. "Automatic keyword extraction from individual documents." Text mining: applications and theory 1 (2010): 1-20.

    Version History

    Introduced in R2020b