Main Content

decode

Decode BERT token codes

Since R2023b

    Description

    example

    str = decode(tokenizer,tokenCodes) decodes the specified token codes using the BERT tokenizer object tokenizer.

    Examples

    collapse all

    Load a pretrained BERT-Base neural network and corresponding tokenizer using the bert function.

    [net,tokenizer] = bert;

    View the tokenizer.

    tokenizer
    tokenizer = 
      bertTokenizer with properties:
    
            IgnoreCase: 1
          StripAccents: 1
          PaddingToken: "[PAD]"
           PaddingCode: 1
            StartToken: "[CLS]"
             StartCode: 102
          UnknownToken: "[UNK]"
           UnknownCode: 101
        SeparatorToken: "[SEP]"
         SeparatorCode: 103
           ContextSize: 512
    
    

    Decode an array of token codes using the decode function.

    tokenCodes = [102 7227 7443 7543 2390 4373 16045 2100 15067 2014 19082 103];
    str = decode(tokenizer,tokenCodes)
    str = 
    "[CLS] bidirectional encoder representations from transformers [SEP]"
    

    Input Arguments

    collapse all

    BERT tokenizer, specified as a bertTokenizer object.

    Token codes, specified as a vector of positive integers.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Output Arguments

    collapse all

    Decoded tokens, returned as a string array.

    Algorithms

    collapse all

    WordPiece Tokenization

    The WordPiece tokenization algorithm [2] splits words into subword units and maps common sequences of characters and subwords to a single integer. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows the model to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.

    These steps outline how to create a WordPiece tokenizer:

    1. Initialize vocabulary — Create an initial vocabulary of the unique characters in the data.

    2. Count token frequencies — Iterate through the training data and count the frequencies of each token in the vocabulary.

    3. Merge most frequent pairs — Identify the most frequent pair of tokens in the vocabulary and merge them into a single token. Update the vocabulary accordingly.

    4. Repeat counting and merging — Repeat the counting and merging steps until the vocabulary reaches a predefined size or until tokens can no longer merge.

    These steps outline how a WordPiece tokenizer tokenizes new text:

    1. Split text — Split text into individual words.

    2. Identify OOV words — Identify any OOV words that are not present in the pretrained vocabulary.

    3. Replace OOV words — Replace the OOV words with their subword counterparts from the vocabulary. For example, by iteratively checking that OOV tokens start with vocabulary tokens.

    References

    [1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

    [2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

    Version History

    Introduced in R2023b