# explanation of mysterious timing results?

1 view (last 30 days)
Jim on 16 Aug 2013
In some legacy code, there's a pair of for-loops, one nested in the other. I created two alternate versions, one that tries to optimize by reducing the number of inner loop operations, and one that's fully vectorized. I wrote a timing routine that iterates over all three versions 10,000 times:
nIter = 10000;
tm = zeros(1,3);
for itr = 1:nIter
% setup variables for all three versions
NM=18; NM1=NM+1; NM3=NM+3;NM4=NM+4;
T1=1; T2=2; T3=3;
i=NM3 : -1 : 1;
iv(i) = i .* (i-1) / 2;
[a b] = deal(zeros(1, NM3*NM4/2));
% V1: original code
tic;
for M = 0 : NM
M1 = M+1;
RM = M;
for N = M1 : NM
RN = N;
LL = iv(N+1) + M + 1;
a(LL) = -sqrt(((T2*RN+T1) * (T2*RN-T1)) / ...
((RN-RM) * (RN+RM)));
b(LL) = sqrt(((T2*RN+T1) * (RN+RM-T1) * (RN-RM-T1)) / ...
((T2*RN-T3) * (RN+RM) * (RN-RM) ) );
end
end
tm(1) = tm(1) + toc;
% V2: reduced inner loop ops
tic;
for M = 0 : NM
M1 = M+1;
for N = M1 : NM
rs = N+M;
rd = N-M;
rp = rs*rd;
rp1 = (rs-1) * (rd-1);
t2n = 2*N;
t2n1 = t2n+1;
LL = iv(N+1) + M1;
a(LL) = -sqrt(t2n1 * (t2n-1) / rp);
b(LL) = sqrt(t2n1 * rp1 / ((t2n-3) * rp));
end
end
tm(2) = tm(2) + toc;
% V3: vectorized code
tic;
sRow = 0 : NM;
s = sRow(ones(NM1,1),:);
t = s'; % same as [s,t] = meshgrid(0:NM)
ltMask = s < t; % mask for array lower triangle below main diag
rs = s + t;
rd = t - s;
rp = rs .* rd;
rp1 = (rs-1) .* (rd-1);
t2n = 2*tCol;
t2n1 = t2n+1;
i = iv(tCol+1) + s(ltMask)' + 1;
a(i) = -sqrt(t2n1 .* (t2n-1) ./ rp);
b(i) = sqrt(t2n1 .* rp1 ./ ((t2n-3) .* rp));
tm(3) = tm(3) + toc;
end % timing loop
descr = {'V1:orig';'V2:reduced ops';'V3:vectorized'};
for i = 1:3
disp([descr{i} ': ' num2str(tm(i)) ' sec for ' num2str(nIter) ...
' iterations']);
end
disp('Normalized times:');
[s loc] = sort(tm);
disp([repmat(char(9),3,1) char(descr(loc)) repmat(' ',3,1) ...
num2str((s./s(1))', '%0.2f')]);
The vectorized code is currently fastest, by about 2.5 times vs the original. What's very surprising is that the reduced operations version (V2) is reported to be way slower than the other two (by a factor of ~140 vs the vectorized). When I run the Profiler (after disabling all but one CPU), the main culprits are shown to be the two sqrt expressions, with the rp and rp1 computations also eating a good share of time.
The changes I made to get V2 from V1 were: # elimination of two unnecessary variables (RM and RN) # replacement of T1, T2, and T3 with their constant values # addition of some new temporary scalars (rs, rd, rp, rp1, t2n, & t2n1) to allow reduction of the floating-point operation count in the inner loop. (The number of multiplications dropped from 10 to 5, and additions/subractions from 12 to 7).
I'm aware there's some penalty for creating the new scalars, but a factor of 140 seems extreme. Can anyone suggest why these changes have such a high time cost?
##### 2 CommentsShowHide 1 older comment
Jim on 19 Aug 2013
Sorry about the typing errors, but they should be corrected now if you want to give it another go.