- Combine the scalar operations to reduce the number of matrix operations.
- Move constants out of the loop.
- Crop the data outside the loops instead of applying an a:b indexing in each iteration.
- Use auto-expanding instead of inflating by repmat. bsxfun is sometimes faster or slower than auto-expanding - try it.
- Avoid repeated calculations. In your case e.g. Ipc(:,1)*dt/hb is computed several times. Using this avoids this:
How to speed up matrix operations with lots of scalar constants involved ?
2 views (last 30 days)
Show older comments
I have tried to transfer all arrays to gpu, but it's much slower.
dCnE=zeros(nr(jz),npmax(jz),'gpuArray');
dCnB=zeros(nr(jz),npmax(jz),'gpuArray');
dCnZ=zeros(nr(jz),npmax(jz)+1,'gpuArray');
Cn=zeros(nr(jz),ns,'gpuArray');
Cn2=zeros(nr(jz),ns,'gpuArray');
Cn0=Cnw(1:nr(jz),3:5:nsw-2); %Cnw is a gpu array.
CnE=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnB=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnZ=zeros(nr(jz),ns,npmax(jz)+1,'gpuArray');
CnZ(:,:,1)=Cnw(1:nr(jz),3:5:nsw-2);
CnZ=permute(CnZ,[1,3,2]);
I=I0*(exp(-((gpuArray.linspace(0,dr*(nr(jz)-1),nr(jz))/rz).^2)))'*exp(-((gpuArray.linspace(-t0,t0,nt+1)/tau).^2));
% i think dI,dCn,dCn2,dCn0 will be totally overwrited in one command, so they don't need to be preallocated.
for jl=1:ns
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth) %The most time-consuming part
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Comparing to my other accelerated code, I think the reason is that there are much more scalar constants involved in this case, so I transfer them to gpu too. But it's still slower than the one without gpu computing. Maybe the reason is that nr(jz) varies from 23~171 which is too small. Is my thought correct? Is there any way of accelerating this script ?
0 Comments
Accepted Answer
Jan
on 20 May 2018
Edited: Jan
on 20 May 2018
If you have a modern Matlab version >= 2016b, use the auto expanding instead of an explicit repmat:
% Ipc = repmat(I(:,jt2),[1,npmax(jz)-1]);
Ipc = I(:,jt2);
... CnE(:,1:npmax(jz)-1,jl)) .* Ipc * dt / hb
I guess, that dt and hb are scalars (please do not let the readers guess!). Then remember, that Matlab evaluate the code from left to right:
Y = X * dt / hb
is equivalent to:
Temp = X * dt;
Y = Temp / hb;
These are 2 matrix operations. But you can combine the scalar operations:
Y = X * (dt / hb)
This is 1 matrix operation and a cheap scalar division.
As Matt J has explained already, operating on subarrays as in |dCnE(:,2:npmax(jz)) is expensive. Better crop the data, such that you can omit the ugly indexing.
I cannot run it by my own, because you did not provide the input values, but compare it optically:
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+ ...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
or:
c1 = 1 - exp(-dt / tau2); % Calculate constants outside the loops!
Ipc = I(:,jt2);
dCnE = (sigma0 * CnZ(:, :, jl) - sigma * CnE(:, :, jl) + ...
vs2 * sigma * CnE(:, :, jl)) .* (Ipc * (dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
This can be simplified a little bit for the CnE term:
dCnE = (sigma0 * CnZ(:, :, jl) + ...
CnE(:, :, jl)) .* (Ipc * ((vs2 * sigma - sigma) * dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
Methods:
C1 = I(:, jt2) * (dt/hb);
The latter includes another idea: You inflate I(:,jt2) at first by Ipc = repmat(...), and waste time with cropping out a subvector Ipc(:,1) later on. Better create the subvector once only: Ipc = I(:,jt2).
Finally: Replace
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
by:
Cn0 = max(Cn0, 0);
and move it outside the loops, because you do not use the cleaned values inside the loop.
1 Comment
More Answers (1)
Matt J
on 20 May 2018
Edited: Matt J
on 20 May 2018
Statements like "2:npmax(jz)" which construct index arrays on the right hand side cost, and you are incurring that cost repeatedly by re-executing them unnecessarily.
Instead of this,
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
Do things like
indices=2:npmax(jz); %pre-compute
dCnE(:,indices)=(sigma0*CnZ(:,indices,jl)-sigma*CnE(:,indices,jl)+...
Extracting sub-arrays like CnZ(:,2:npmax(jz),jl) is similarly costly. Use temporary variables to avoid this as well, e.g.
temp=CnZ(:,2:npmax(jz),jl);
dCnZ(:,2:npmax(jz))=-sigma0*tmp.*Ipc*dt/hb-(beta0*tmp.*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
Finally, repmat should be unnecessary if you are using a Matlab version more recent than R2016b.
4 Comments
Matt J
on 21 May 2018
I don't see any temporary arrays in what you posted. It looks identical to your original code.
See Also
Categories
Find more on Logical in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!