Main Content

encode

Tokenize and encode text for BERT model

Since R2023b

    Description

    example

    [tokenCodes,segments] = encode(tokenizer,str) tokenizes and encodes the text in str using the specified Bidirectional Encoder Representations from Transformers (BERT) tokenizer and returns the token codes and segments. This syntax automatically adds padding, start, unknown, and separator tokens to the input.

    [tokenCodes,segments] = encode(tokenizer,str1,str2) tokenizes and encodes the sentence pair str1,str2. This syntax automatically adds padding, start, unknown, and separator tokens to the input.

    ___ = encode(___,AddSpecialTokens=tf) specifies whether to add padding, start, unknown, and separator tokens to the input.

    Examples

    collapse all

    Load a pretrained BERT-Base neural network and corresponding tokenizer using the bert function.

    [net,tokenizer] = bert;

    View the tokenizer.

    tokenizer
    tokenizer = 
      bertTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: "[PAD]"
           PaddingCode: 1
            StartToken: "[CLS]"
             StartCode: 102
          UnknownToken: "[UNK]"
           UnknownCode: 101
        SeparatorToken: "[SEP]"
         SeparatorCode: 103
           ContextSize: 512
    
    

    Tokenize and encode the text "Bidirectional Encoder Representations from Transformers" using the encode function.

    str = "Bidirectional Encoder Representations from Transformers";
    [tokenCodes,segments] = encode(tokenizer,str);

    View the token codes.

    tokenCodes
    tokenCodes = 1x1 cell array
        {[102 7227 7443 7543 2390 4373 16045 2100 15067 2014 19082 103]}
    
    

    View the segments.

    segments
    segments = 1x1 cell array
        {[1 1 1 1 1 1 1 1 1 1 1 1]}
    
    

    Input Arguments

    collapse all

    BERT tokenizer, specified as a bertTokenizer object.

    Input text, specified as a string array, character vector, or cell array of character vectors.

    Example: ["An example of a short sentence."; "A second short sentence."]

    Data Types: string | char | cell

    Input sentence pairs, specified as string arrays, character vectors, or cell arrays of character vectors of the same size.

    If you specify str1 and str2, then the function returns concatenated token codes. To identify which input corresponds to each token code, use the segments output argument.

    Data Types: char | string | cell

    Flag to add padding, start, unknown, and separator tokens to input, specified as 1 (true) or 0 (false).

    Output Arguments

    collapse all

    Token codes, returned as a cell array of vectors of positive integers. The token codes index into the tokenizer vocabulary.

    Data Types: cell

    Segment indices, returned as a cell array of vectors of positive integers.

    The segments indicate which input token corresponds to which input. For each element s of the cell array, the value s(i) indicates which input corresponds to tokenCodes(i). If you specify a single string as input, then each element of segments is an array of ones.

    Data Types: cell

    Algorithms

    collapse all

    WordPiece Tokenization

    The WordPiece tokenization algorithm [2] splits words into subword units and maps common sequences of characters and subwords to a single integer. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows the model to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.

    These steps outline how to create a WordPiece tokenizer:

    1. Initialize vocabulary — Create an initial vocabulary of the unique characters in the data.

    2. Count token frequencies — Iterate through the training data and count the frequencies of each token in the vocabulary.

    3. Merge most frequent pairs — Identify the most frequent pair of tokens in the vocabulary and merge them into a single token. Update the vocabulary accordingly.

    4. Repeat counting and merging — Repeat the counting and merging steps until the vocabulary reaches a predefined size or until tokens can no longer merge.

    These steps outline how a WordPiece tokenizer tokenizes new text:

    1. Split text — Split text into individual words.

    2. Identify OOV words — Identify any OOV words that are not present in the pretrained vocabulary.

    3. Replace OOV words — Replace the OOV words with their subword counterparts from the vocabulary. For example, by iteratively checking that OOV tokens start with vocabulary tokens.

    References

    [1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

    [2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

    Version History

    Introduced in R2023b