join

Combine multiple bag-of-words or bag-of-n-grams models

Syntax

newBag = join(bag)
newBag = join(bag,dim)

Description

example

newBag = join(bag) combines the elements in the array bag by merging the frequency counts. The function combines the elements along the first dimension not equal to 1.

newBag = join(bag,dim) combines the elements in the array bag along the dimension dim.

Examples

collapse all

Create an array of two bags-of-words models from tokenized documents.

str = [ ...
    "an example of a short sentence"
    "a second short sentence"];
documents = tokenizedDocument(str);
bag(1) = bagOfWords(documents(1));
bag(2) = bagOfWords(documents(2))
bag = 
  1x2 bagOfWords array with properties:

    Counts
    Vocabulary
    NumWords
    NumDocuments

Combine the bag-of-words models using join.

bag = join(bag)
bag = 
  bagOfWords with properties:

          Counts: [2x7 double]
      Vocabulary: [1x7 string]
        NumWords: 7
    NumDocuments: 2

If your text data is contained in multiple files in a folder, then you can import the text data and create a bag-of-words model in parallel using parfor. If you have Parallel Computing Toolbox™ installed, then the parfor loop runs in parallel, otherwise, it runs in serial. Use join to combine an array of bag-of-words models into one model.

Create a bag-of-words model from a collection of files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Get a list of the files and their locations using dir.

fileLocation = fullfile(matlabroot,'examples','textanalytics','exampleSonnet*.txt');
fileInfo = dir(fileLocation)
fileInfo = 5x1 struct array with fields:
    name
    folder
    date
    bytes
    isdir
    datenum

Initialize an empty bag-of-words model and then loop over the files and create an array of bag-of-words models.

bag = bagOfWords;

numFiles = numel(fileInfo);
parfor i = 1:numFiles
    f = fileInfo(i);
    filename = fullfile(f.folder,f.name);
    
    textData = extractFileText(filename);
    document = tokenizedDocument(textData);
    bag(i) = bagOfWords(document);
end
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 12).

Combine the bag-of-words models using join.

bag = join(bag)
bag = 
  bagOfWords with properties:

          Counts: [5x3275 double]
      Vocabulary: [1x3275 string]
        NumWords: 3275
    NumDocuments: 5

Input Arguments

collapse all

Array of bag-of-words or bag-of-n-grams models, specified as a bagOfWords array or a bagOfNgrams array. If bag is a bagOfNgrams array, then each element to be joined must have the same value for the NgramLengths property.

Dimension along which to join models, specified as a positive integer. If dim is not specified, then the default is the first dimension with a size that does not equal 1.

Output Arguments

collapse all

Output model, returned as a bagOfWords object or a bagOfNgrams object. The type of newBag is the same as the type of bag. newBag has the same data type as the input model and has a size of 1 along the dimension being joined.

Introduced in R2018a