Why is my GPU code faster with the profiler on in RTX GPUs?
Show older comments
I need to process large multidimensional arrays with a series of 1D convolutions, and I found it's faster to just implement the convolution by hand in a for loop instead of using conv due to the very small kernel size. However, my code runs significantly faster when the profiler is on in certain GPUs. In particular, it is consistently 1.5x to 2x faster when using an Nvidia RTX 3080 or an Nvidia RTX 2070; when I run the code in an Nvidia A4500 or Nvidia A5000, there is no significant difference. This is significant because a single dataset can take hours.
This behavior is consistent among multiple computers, all running Linux (Ubuntu 22.04), and tested with R2021a and R2022a, and with nvidia drivers versions 515 and 520. My question is, how can I make sure I get the "fast" performance without having to embed profile on and profile off in the relevant parts of my code? I have actually done this and I benefit directly from improved performance in the big picture of processing an entire dataset, but this is hacky and will interfere with the expected use of the profiler in the rest of the code.
MWE is here. I am placing the fastest run first to avoid confusion about the second instance potentially running faster due to the JIT or caching. I am also clearing the large variables between runs to avoid confusion about memory allocation. I am also using the results to calculate arrayMean to avoid confusion about the JIT optimizing (i.e., skipping operations) for unused results. Interestingly, the above three concenrs do not matter in practice and the code runs consistently faster with the profiler on.
% Define common parms
clear
convSize = 3;
largeArraySizes = [40, 40, 40, 5000] + [1, 1, 1, 0] * (2 * convSize + 1);
% Run with profiler on. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('on')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
% Shifted index
idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
% Sum over convolved index
largeArrayConv = largeArrayConv + ...
convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOn = toc;
profile('off')
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler ON: %g seconds.\n', timeProfOn)
% Run with profiler off. First, preallocate and create variables.
largeArray = ones(largeArraySizes, 'single', 'gpuArray');
convKernel = ones(2 * convSize + 1, 1, 'single', 'gpuArray');
profile('off')
tic;
largeArrayConv = zeros(size(largeArray, 1), size(largeArray, 2),...
size(largeArray, 3) - 2 * convSize, size(largeArray, 4), 'like', largeArray);
% Convolve manually in a for loop
for thisShift = -convSize:convSize
% Shifted index
idx = convSize + (1:size(largeArray, 3) - 2 * convSize) + thisShift;
% Sum over convolved index
largeArrayConv = largeArrayConv + ...
convKernel(convSize + 1 + thisShift) .* largeArray(:, :, idx, :, :, :) / (2 * convSize + 1);
end
largeArrayConv = gather(largeArrayConv);
timeProfOff = toc;
arrayMean = mean(largeArrayConv, 'all');
clear largeArray convKernel largeArrayConv arrayMean
fprintf('Proc time profiler OFF: %g seconds.\n', timeProfOff)
Accepted Answer
More Answers (0)
Categories
Find more on Loops and Conditional Statements in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!