subwordTokenize

Tokenize text into subwords using BERT tokenizer

Since R2023b

Syntax

subwords = subwordTokenize(tokenizer,str)

subwords = subwordTokenize(tokenizer,str1,str2)

subwords = subwordTokenize(___,AddSpecialTokens=tf)

Description

subwords = subwordTokenize(tokenizer,str) tokenizes the text in str into subwords using the specified Bidirectional Encoder Representations from Transformers (BERT) tokenizer. This syntax automatically adds special tokens to the input.

example

subwords = subwordTokenize(tokenizer,str1,str2) tokenizes the sentence pair str1,str2 into subwords. This syntax automatically adds special tokens to the input.

subwords = subwordTokenize(___,AddSpecialTokens=tf) specifies whether to add special tokens to the input.

Examples

collapse all

Tokenize Text Into Subwords

This example uses:

Open Live Script

Load a pretrained BERT-Base neural network and corresponding tokenizer using the bert function.

[net,tokenizer] = bert;

View the tokenizer.

tokenizer

tokenizer = 
  bertTokenizer with properties:

        IgnoreCase: 1
      StripAccents: 1
      PaddingToken: "[PAD]"
       PaddingCode: 1
        StartToken: "[CLS]"
         StartCode: 102
      UnknownToken: "[UNK]"
       UnknownCode: 101
    SeparatorToken: "[SEP]"
     SeparatorCode: 103
       ContextSize: 512
         NumTokens: 30522

Tokenize the text "Bidirectional Encoder Representations from Transformers" into subwords using the subwordTokenize function.

str = "Bidirectional Encoder Representations from Transformers";
subwords = subwordTokenize(tokenizer,str)

subwords = 1×1 cell array
    {["[CLS]"    "bid"    "##ire"    "##ction"    "##al"    "en"    "##code"    "##r"    "representations"    "from"    "transformers"    "[SEP]"]}

Input Arguments

collapse all

`tokenizer` — Tokenizer
`bertTokenizer` object

Tokenizer, specified as a bertTokenizer object.

`str` — Input text
string array | character vector | cell array of character vectors

Input text, specified as a string array, character vector, or cell array of character vectors.

Example: ["An example of a short sentence."; "A second short sentence."]

Data Types: string | char | cell

`str1,str2` — Input sentence pairs
string arrays | character vectors | cell arrays of character vectors

Input sentence pairs, specified as string arrays, character vectors, or cell arrays of character vectors of the same size.

If you specify str1 and str2, then the function returns concatenated tokenized subwords.

Data Types: char | string | cell

`tf` — Flag to add padding, start, unknown, and separator tokens to input
`1` (`true`) (default) | `0` (`false`)

Flag to add padding, start, unknown, and separator tokens to input, specified as 1 (true) or 0 (false).

Output Arguments

collapse all

`subwords` — Tokenized subwords
string array

Tokenized subwords, returned as a string array.

Data Types: string

Algorithms

collapse all

WordPiece Tokenization

The WordPiece tokenization algorithm [2] splits words into subword units and maps common sequences of characters and subwords to a single integer. During tokenization, the algorithm replaces out-of-vocabulary (OOV) words with subword counterparts, which allows models to handle unseen words more effectively. This process creates a set of subword tokens that can better represent common and rare words.

These steps outline how to create a WordPiece tokenizer:

Initialize vocabulary — Create an initial vocabulary of the unique characters in the data.
Count token frequencies — Iterate through the training data and count the frequencies of each token in the vocabulary.
Merge most frequent pairs — Identify the most frequent pair of tokens in the vocabulary and merge them into a single token. Update the vocabulary accordingly.
Repeat counting and merging — Repeat the counting and merging steps until the vocabulary reaches a predefined size or until tokens can no longer merge.

These steps outline how a WordPiece tokenizer tokenizes new text:

Split text — Split text into individual words.
Identify OOV words — Identify any OOV words that are not present in the pretrained vocabulary.
Replace OOV words — Replace the OOV words with their subword counterparts from the vocabulary. For example, by iteratively checking that OOV tokens start with vocabulary tokens.

References

[1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

[2] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun et al. "Google's Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation." Preprint, submitted October 8, 2016. https://doi.org/10.48550/arXiv.1609.08144

Version History

Introduced in R2023b

subwordTokenize

Syntax

Description

Examples

Tokenize Text Into Subwords

Input Arguments

`tokenizer` — Tokenizer
`bertTokenizer` object

`str` — Input text
string array | character vector | cell array of character vectors

`str1,str2` — Input sentence pairs
string arrays | character vectors | cell arrays of character vectors

`tf` — Flag to add padding, start, unknown, and separator tokens to input
`1` (`true`) (default) | `0` (`false`)

Output Arguments

`subwords` — Tokenized subwords
string array

Algorithms

WordPiece Tokenization

References

Version History

See Also

Topics

subwordTokenize

Syntax

Description

Examples

Tokenize Text Into Subwords

Input Arguments

tokenizer — Tokenizer bertTokenizer object

str — Input text string array | character vector | cell array of character vectors

str1,str2 — Input sentence pairs string arrays | character vectors | cell arrays of character vectors

tf — Flag to add padding, start, unknown, and separator tokens to input 1 (true) (default) | 0 (false)

Output Arguments

subwords — Tokenized subwords string array

Algorithms

WordPiece Tokenization

References

Version History

See Also

Topics

`tokenizer` — Tokenizer
`bertTokenizer` object

`str` — Input text
string array | character vector | cell array of character vectors

`str1,str2` — Input sentence pairs
string arrays | character vectors | cell arrays of character vectors

`tf` — Flag to add padding, start, unknown, and separator tokens to input
`1` (`true`) (default) | `0` (`false`)

`subwords` — Tokenized subwords
string array