mode with categorical variables and parfor is slow

3 views (last 30 days)
Hello everybody,
I don't understand why the below (sketched) code is slow.
Consider the following vector, with element potentially repeated and the vector of unique values associated to it:
potential_rep_idx = categorical(randi(N,1));
unique_idx = unique(potential_rep_idx);
The purpose of the code is to take a table called "table_of_stuff" made of a table "table_other stuff", made of several columns of various types (double, datetime, cells, strings) and the above vector as follows:
table_of_stuff = [array2table(potential_rep_idx), table_other_stuff]
and identify, for each element of unique_idx, all lines of table_of_stuff in which the element appears. Then, from all these lines, make one single line in which each element corresponds to the mode of the values for that column.
In other words:
table_of_stuff = a long table with columns of various type (double, datetime, cells, strings)
table_of_stuff = categorical(table_of_stuff);
parfor i=1:N
find_idx = find( potential_rep_idx == unique_idx(i) ) ;
mode_table(i,:) = array2table(mode((table_of_stuff{find_idx, : }),1)); %
end

Answers (1)

Raghav
Raghav on 5 May 2023
Hi,
Based on the question, it can be understand that parfor is working slow for your code.
There are a few reasons why the code you provided may be slow:
  1. Using find and indexing with logical operations: In the line find_idx = find(potential_rep_idx == unique_idx(i)), you are using the find function with a logical operation to index into the potential_rep_idx vector. This creates a temporary logical vector, which can be memory-intensive and slow for large arrays.
  2. Using mode function inside a loop: The mode function is being used inside a loop, which can be inefficient for large datasets. It is generally better to use vectorized operations instead of loops whenever possible.
  3. Creating a new table in each iteration of the loop: Inside the loop, a new table is being created in each iteration using array2table. This can be memory-intensive and slow for large datasets.
To improve the performance of the code, you can consider the following:
  1. Avoid using find and logical indexing: Instead of using find and logical indexing, you can use the ismember function to directly find the indices of the unique values in the potential_rep_idx vector.
  2. Use vectorized operations instead of loops: You can use the splitapply function to split the table into groups based on the values in the potential_rep_idx vector, apply the mode function to each group, and then combine the results into a single table. This can be much more efficient than using a loop.
  3. Avoid creating a new table in each iteration of the loop: Instead of creating a new table in each iteration of the loop, you can preallocate a matrix or cell array to store the results and then convert it to a table after the loop is finished.
Hope it helps,
Raghav Bansal

Categories

Find more on Dates and Time in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!