Why is this loop faster than a vectorised version? Could the vectorised version be made faster than the loop?
Show older comments
I'm trying to improve performance in a code that uses a loop. I've written a vectorised version matching the functionality, while avoiding costly transposes. However, I've found that the loop version invariably runs ~25% more quickly. Is there any way to further improve the performance of the vectorised version so that it surpasses the loop?
Of course, this is a tiny sub-function of a much larger, more complex program, but it is called tens of thousands of times in a single run, and is a bottleneck in the run time.
I do have the parallel computing toolbox, so could look into using parfor loops, but these don't always save time, and I was surprised that the vectorised version doesn't perform better!
% Input vectors
x1 = rand(1, 960);
x2 = rand(1, 960);
%% Looped version
tic;
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutLoop = zeros(length(x2), length(Y1));
for i=1:length(Y1)
p1 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1).*ones(1, length(x2));
p2 = 4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1) - 0.35.*(4.*Y2(i)./24.673.*(0.004368.*Y2(i) + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*(x2 - 51);
p3 = p1.*(x1 >= Y2(i)) + p2 .* (x1 < Y2(i));
g1 = (x1 - Y2(i))./Y2(i);
g1 = abs(g1);
EiLoop = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).* 10.^(x2./10);
EoutLoop((1:length(x2)), i) = EiLoop(1:length(x2));
end
if (size(EoutLoop,1) > 1)
EoutLoop = sum(EoutLoop);
end
EoutLoop = 10 .* log(EoutLoop) ./ log(10);
% end timer
toc;
%% Vectorised version
% transpose input vector for vectorised version
x2 = x2.';
x1 = x1.';
% start timer
tic;
Y1 = 1.75:0.25:39;
Y2 = (10 .^ (Y1 / 21.366) - 1 ) / 0.004368;
EoutVec = zeros(length(x2), length(Y1));
p1 = 4.*Y2./24.673.*(0.004368.*Y2 + 1).*ones(length(x2), length(Y2));
p2 = 4.*Y2./24.673.*(0.004368.*Y2 + 1) - 0.35.*(4.*Y2./24.673.*(0.004368.*Y2 + 1)./4.*1000./24.673.*(0.004368.*1000 + 1)).*repmat((x2 - 51), 1, length(Y2));
p3 = p1.*(x1 >= repmat(Y2, 1, size(x1, 2))) + p2 .* (x1 < repmat(Y2, 1, size(x1, 2)));
g1 = ((x1 - repmat(Y2, 1, size(x1, 2)))./repmat(Y2, 1, size(x1, 2)));
g1 = abs(g1);
EVec = (1 + p3.*min(g1,4)).*exp(-1.*p3.*min(g1,4)).*repmat(10.^(x2./10), 1, length(Y2));
EoutVec((1:length(x2)), :) = EVec(1:length(x2), :);
if (size(EoutVec,1) > 1)
EoutVec = sum(EoutVec);
end
EoutVec = 10.*log(EoutVec)./log(10);
% end timer
toc;
3 Comments
Edric Ellis
on 22 Nov 2023
On my machine (a fairly beefy ThreadRipper running Linux, with R2023b), I find the for loop version to be slower than the vectorised version by a factor of about 10... Sometimes vectorised versions can be slower because you need to make more large temporary variables.
Michael
on 22 Nov 2023
Alexander
on 22 Nov 2023
I agree. On my old Win7 machine (R2021b) the result is
Loop: Elapsed time is 0.316945 seconds.
Vectorised: Elapsed time is 0.062135 seconds.
Accepted Answer
More Answers (0)
Categories
Find more on Parallel for-Loops (parfor) in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!