encodeTokens

Convert tokens to token codes

Since R2023b

collapse all in page

Syntax

[tokenCodes,segments] = encodeTokens(tokenizer,tokens)

[tokenCodes,segments] = encodeTokens(tokenizer,tokens1,tokens2)

[tokenCodes,segments,idx] = encodeTokens(___)

___ = encodeTokens(___,AddSpecialTokens=tf)

Description

[tokenCodes,segments] = encodeTokens(tokenizer,tokens) encodes tokens using the specified tokenizer and returns the token codes and segments. This syntax automatically adds special tokens to the input.

example

[tokenCodes,segments] = encodeTokens(tokenizer,tokens1,tokens2) encodes the sentence pair tokens1,tokens2. This syntax automatically adds special tokens to the input.

[tokenCodes,segments,idx] = encodeTokens(___) also returns the mapping between the input and the encoded output.

___ = encodeTokens(___,AddSpecialTokens=tf) specifies whether to add special tokens to the input.

Examples

collapse all

Encode Tokens for BERT Model

This example uses:

Open Live Script

Load a pretrained BERT-Base neural network and corresponding tokenizer using the bert function.

[net,tokenizer] = bert;

View the tokenizer.

tokenizer

tokenizer = 
  bertTokenizer with properties:

        IgnoreCase: 1
      StripAccents: 1
      PaddingToken: "[PAD]"
       PaddingCode: 1
        StartToken: "[CLS]"
         StartCode: 102
      UnknownToken: "[UNK]"
       UnknownCode: 101
    SeparatorToken: "[SEP]"
     SeparatorCode: 103
       ContextSize: 512
         NumTokens: 30522

Encode the tokens "Bidirectional", "Encoder", "Representations", "from", and "Transformers" using the encodeTokens function.

tokens = ["Bidirectional" "Encoder" "Representations" "from" "Transformers"];
[tokenCodes,segments] = encodeTokens(tokenizer,tokens);

View the token codes.

tokenCodes

tokenCodes = 1×1 cell array
    {[102 7227 7443 7543 2390 4373 16045 2100 15067 2014 19082 103]}

View the segments.

segments

segments = 1×1 cell array
    {[1 1 1 1 1 1 1 1 1 1 1 1]}

Input Arguments

collapse all

`tokenizer` — Tokenizer
`bertTokenizer` object | `bpeTokenizer` object (since R2024a)

Tokenizer, specified as a bertTokenizer or bpeTokenizer object.

`tokens` — Input tokens
`tokenizedDocument` array | string array | cell array of character vectors

Input tokens, specified as a tokenizedDocument array, string array, or cell array of character vectors.

`tokens1,tokens2` — Input sentence pairs
`tokenizedDocument` arrays | string arrays | cell arrays of character vectors

Input sentence pairs, specified as tokenizedDocument arrays, string arrays, or cell arrays of character vectors.

`tf` — Flag to add padding, start, unknown, and separator tokens to input
`1` (`true`) (default) | `0` (`false`)

Flag to add padding, start, unknown, and separator tokens to input, specified as 1 (true) or 0 (false).

Output Arguments

collapse all

`tokenCodes` — Token codes
cell array of vectors of positive integers

Token codes, returned as a cell array of vectors of positive integers. The token codes index into the tokenizer vocabulary.

Data Types: cell

`segments` — Segment indices
cell array of vectors of positive integers

Segment indices, returned as a cell array of vectors of positive integers.

The segments indicate which input token corresponds to which input. For each element s of the cell array, the value s(i) indicates which input corresponds to tokenCodes(i). If you specify a single string as input, then each element of segments is an array of ones.

Data Types: cell

`idx` — Mapping between token codes and inputs
vector of positive integers

Mapping between the token codes and inputs, returned as a vector of positive integers.

The value idx(i) indicates which input token corresponds to tokenCodes(i). If tf is 1 (true) and tokenCodes(i) is the padding, start, unknown, or separator code of the tokenizer, then idx(i) is NaN.

Data Types: double

Algorithms

collapse all

WordPiece Tokenization

The WordPiece tokenization algorithm [2] splits words into subword units and maps common sequences of characters and subwords to a single integer. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows models to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.

These steps outline how to create a WordPiece tokenizer:

Initialize vocabulary — Create an initial vocabulary of the unique characters in the data.
Count token frequencies — Iterate through the training data and count the frequencies of each token in the vocabulary.
Merge most frequent pairs — Identify the most frequent pair of tokens in the vocabulary and merge them into a single token. Update the vocabulary accordingly.
Repeat counting and merging — Repeat the counting and merging steps until the vocabulary reaches a predefined size or until tokens can no longer merge.

These steps outline how a WordPiece tokenizer tokenizes new text:

Split text — Split text into individual words.
Identify OOV words — Identify any OOV words that are not present in the pretrained vocabulary.
Replace OOV words — Replace the OOV words with their subword counterparts from the vocabulary. For example, by iteratively checking that OOV tokens start with vocabulary tokens.

Byte Pair Encoding

Byte pair encoding (BPE) is a tokenization algorithm that allows transformer networks to handle a wide range of vocabulary without assigning individual tokens for every possible word. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows models to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.

These steps outline the algorithm for training a BPE tokenizer:

Start with a corpus of text. For example, a corpus that includes phrases like "use byte pair encoding to tokenize text". Split the text data into words using a specified pretokenization algorithm.
Initialize a vocabulary of bytes. For example, start with a vocabulary of ["a" "b" "c" ... "z"]. For non-ASCII characters, like emojis that consist of multiple bytes, start with the byte values that comprise the character.
Encode each word in the text data as a sequence of bytes, and represent the words as sequences of integers that specify the indices of the tokens in the vocabulary. For example, represent the word "use" as [21 19 5]. When the encoding of a character is more than one byte, the resulting sequence of bytes can have more elements than the number of characters in the word.
Count the frequency of all adjacent pairs of bytes in the corpus. For example, among the words ["use" "byte" "pair" "encoding" "to" "tokenize" "text"], the token pairs ["t" "e"], ["e" "n"], and ["t" "o"] appear twice, and the remaining pairs appear once.
Identify the most frequent pair and add the corresponding merged token to the vocabulary. In the words represented as sequences of vocabulary indices, replace the corresponding pairs with the index of the new merged token in the vocabulary. Then, add this token pair to the merge list. For example, append the token pair ["t" "e"] to the merge list. Then, add the corresponding merged token "te" to the vocabulary so that it has the index 27. Finally, in the text data represented as vocabulary indices, replace the pairs of vocabulary indices [20 5] (which corresponds to ["t" "e"]) with the corresponding new vocabulary index:
- The representation [2 25 20 5] for the word "byte" becomes [2 25 27].
- The representation [20 5 24 20] for the word "text" becomes [27 24 20].
Repeat the frequency count and merge operations until you reach a specified number of iterations or vocabulary size. For example, repeating these steps several times leads to merging the pair ["b" "y"] to make the token "by", and then subsequently the pair ["by" "te"] to make the token "byte".

These steps outline how a BPE tokenizer tokenizes new text:

Pretokenization — Split text into individual words.
Byte-encoding — Encode each word into sequences of bytes.
Merge — By starting at the top of the merge list and progressing through it, iteratively apply each merge to pairs of tokens when possible.

References

[1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

[2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

Version History

Introduced in R2023b

encodeTokens

Syntax

Description

Examples

Encode Tokens for BERT Model

Input Arguments

`tokenizer` — Tokenizer
`bertTokenizer` object | `bpeTokenizer` object (since R2024a)

`tokens` — Input tokens
`tokenizedDocument` array | string array | cell array of character vectors

`tokens1,tokens2` — Input sentence pairs
`tokenizedDocument` arrays | string arrays | cell arrays of character vectors

`tf` — Flag to add padding, start, unknown, and separator tokens to input
`1` (`true`) (default) | `0` (`false`)

Output Arguments

`tokenCodes` — Token codes
cell array of vectors of positive integers

`segments` — Segment indices
cell array of vectors of positive integers

`idx` — Mapping between token codes and inputs
vector of positive integers

Algorithms

WordPiece Tokenization

Byte Pair Encoding

References

Version History

See Also

Topics

encodeTokens

Syntax

Description

Examples

Encode Tokens for BERT Model

Input Arguments

tokenizer — Tokenizer bertTokenizer object | bpeTokenizer object (since R2024a)

tokens — Input tokens tokenizedDocument array | string array | cell array of character vectors

tokens1,tokens2 — Input sentence pairs tokenizedDocument arrays | string arrays | cell arrays of character vectors

tf — Flag to add padding, start, unknown, and separator tokens to input 1 (true) (default) | 0 (false)

Output Arguments

tokenCodes — Token codes cell array of vectors of positive integers

segments — Segment indices cell array of vectors of positive integers

idx — Mapping between token codes and inputs vector of positive integers

Algorithms

WordPiece Tokenization

Byte Pair Encoding

References

Version History

See Also

Topics

`tokenizer` — Tokenizer
`bertTokenizer` object | `bpeTokenizer` object (since R2024a)

`tokens` — Input tokens
`tokenizedDocument` array | string array | cell array of character vectors

`tokens1,tokens2` — Input sentence pairs
`tokenizedDocument` arrays | string arrays | cell arrays of character vectors

`tf` — Flag to add padding, start, unknown, and separator tokens to input
`1` (`true`) (default) | `0` (`false`)

`tokenCodes` — Token codes
cell array of vectors of positive integers

`segments` — Segment indices
cell array of vectors of positive integers

`idx` — Mapping between token codes and inputs
vector of positive integers