# How to speed up matrix operations with lots of scalar constants involved ?

1 view (last 30 days)
ben huang on 20 May 2018
Edited: ben huang on 21 May 2018
I have tried to transfer all arrays to gpu, but it's much slower.
dCnE=zeros(nr(jz),npmax(jz),'gpuArray');
dCnB=zeros(nr(jz),npmax(jz),'gpuArray');
dCnZ=zeros(nr(jz),npmax(jz)+1,'gpuArray');
Cn=zeros(nr(jz),ns,'gpuArray');
Cn2=zeros(nr(jz),ns,'gpuArray');
Cn0=Cnw(1:nr(jz),3:5:nsw-2); %Cnw is a gpu array.
CnE=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnB=zeros(nr(jz),npmax(jz),ns,'gpuArray');
CnZ=zeros(nr(jz),ns,npmax(jz)+1,'gpuArray');
CnZ(:,:,1)=Cnw(1:nr(jz),3:5:nsw-2);
CnZ=permute(CnZ,[1,3,2]);
I=I0*(exp(-((gpuArray.linspace(0,dr*(nr(jz)-1),nr(jz))/rz).^2)))'*exp(-((gpuArray.linspace(-t0,t0,nt+1)/tau).^2));
% i think dI,dCn,dCn2,dCn0 will be totally overwrited in one command, so they don't need to be preallocated.
for jl=1:ns
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth) %The most time-consuming part
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Comparing to my other accelerated code, I think the reason is that there are much more scalar constants involved in this case, so I transfer them to gpu too. But it's still slower than the one without gpu computing. Maybe the reason is that nr(jz) varies from 23~171 which is too small. Is my thought correct? Is there any way of accelerating this script ?

Jan on 20 May 2018
Edited: Jan on 20 May 2018
If you have a modern Matlab version >= 2016b, use the auto expanding instead of an explicit repmat:
% Ipc = repmat(I(:,jt2),[1,npmax(jz)-1]);
Ipc = I(:,jt2);
... CnE(:,1:npmax(jz)-1,jl)) .* Ipc * dt / hb
I guess, that dt and hb are scalars (please do not let the readers guess!). Then remember, that Matlab evaluate the code from left to right:
Y = X * dt / hb
is equivalent to:
Temp = X * dt;
Y = Temp / hb;
These are 2 matrix operations. But you can combine the scalar operations:
Y = X * (dt / hb)
This is 1 matrix operation and a cheap scalar division.
As Matt J has explained already, operating on subarrays as in |dCnE(:,2:npmax(jz)) is expensive. Better crop the data, such that you can omit the ugly indexing.
I cannot run it by my own, because you did not provide the input values, but compare it optically:
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+ ...
vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
or:
c1 = 1 - exp(-dt / tau2); % Calculate constants outside the loops!
Ipc = I(:,jt2);
dCnE = (sigma0 * CnZ(:, :, jl) - sigma * CnE(:, :, jl) + ...
vs2 * sigma * CnE(:, :, jl)) .* (Ipc * (dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
This can be simplified a little bit for the CnE term:
dCnE = (sigma0 * CnZ(:, :, jl) + ...
CnE(:, :, jl)) .* (Ipc * ((vs2 * sigma - sigma) * dt / hb)) + ...
CnB(:,:,jl) * c1 + ...
(vs2 * beta0 * dt / (2*hb)) * CnZ(:, :, jl) .* (Ipc .^ 2) - ...
CnE(:, :, jl) / (tau1 * dt);
Methods:
1. Combine the scalar operations to reduce the number of matrix operations.
2. Move constants out of the loop.
3. Crop the data outside the loops instead of applying an a:b indexing in each iteration.
4. Use auto-expanding instead of inflating by repmat. bsxfun is sometimes faster or slower than auto-expanding - try it.
5. Avoid repeated calculations. In your case e.g. Ipc(:,1)*dt/hb is computed several times. Using this avoids this:
C1 = I(:, jt2) * (dt/hb);
The latter includes another idea: You inflate I(:,jt2) at first by Ipc = repmat(...), and waste time with cropping out a subvector Ipc(:,1) later on. Better create the subvector once only: Ipc = I(:,jt2).
Finally: Replace
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
by:
Cn0 = max(Cn0, 0);
and move it outside the loops, because you do not use the cleaned values inside the loop.

#### 1 Comment

ben huang on 21 May 2018
Sorry for unclear explanation, sigma0, sigma, tau1, tau2, dt, hb, vs2, beta0 are all scalars. I have tested all your suggestions without gpu. The results are as follows (In each method, except the parts mentioned, all other parts are the same as the original one)
method 1,2,5
original (6.8328 sec)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
modified (5.8009 sec)
outside the loop
Cexp=1-exp(-dt/tau2);
vsdh=vs2*sigma*dt/hb;
bd2h=beta0*dt/2/hb;
ddt=dt/tau1;
mvsdh=(1-vs2)*sigma*dt/hb;
mvd2hb=(1-vs2)*dt/(2*hb)*beta0;
msdh=-sigma0*dt/hb;
ddh=dt/hb;
in the loop
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vsdh*CnE(:,1:npmax(jz)-1,jl)).*Ipc+CnB(:,2:npmax(jz),jl)*Cexp...
+vs2*(bd2h*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))-ddt*CnE(:,2:npmax(jz),jl);
dCnB(:,2:npmax(jz))=mvsdh*CnE(:,1:npmax(jz)-1,jl).*Ipc+mvd2hb*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2)...
-CnB(:,2:npmax(jz),jl)*Cexp;
dCnZ(:,2:npmax(jz))=msdh*CnZ(:,2:npmax(jz),jl).*Ipc-bd2h*CnZ(:,2:npmax(jz),jl).*Ipc.^2+CnE(:,1:npmax(jz)-1,jl)*ddt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*ddh+...
-CnE(:,1,jl)*ddt;
dCnZ(:,1)=msdh*CnZ(:,1,jl).*Ipc(:,1)-bd2h*CnZ(:,1,jl).*Ipc(:,1).^2;
good method! thanks a lot!
Method 3:
To corp the arrays out of the inner loop and let the temporary arrays take the place of CnE, CnB, CnZ in inner loop, I have to add some commands to connect them. Except for the replacement of CnE, CnB, CnZ, I used comments to mark other changed parts in the modified version.
original (6.7290 sec)
for jl=1:ns
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt; %(change of S1 state population density)
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Modified (8.7912 sec)
for jl=1:ns
if (npmax(jz)>nphth)
CnZt2=CnZ(:,2:npmax(jz),jl); % connecting commands
CnEt2=CnE(:,2:npmax(jz),jl);
CnBt2=CnB(:,2:npmax(jz),jl);
CnZt1=CnZ(:,1:npmax(jz)-1,jl);
CnEt1=CnE(:,1:npmax(jz)-1,jl);
end
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt;
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZt2-sigma*CnEt2+vs2*sigma*CnEt1).*Ipc*dt/hb+CnBt2*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)-CnEt2/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnEt1.*Ipc*dt/hb+(1-vs2)*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)...
-CnBt2*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZt2.*Ipc*dt/hb-(beta0*CnZt2.*Ipc.^2)*dt/(2*hb)+CnEt1/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZt2=CnZt2+dCnZ(:,2:npmax(jz)); % the number of arrays whose changes need to be record is increase from 3 to 5
CnZt1=CnZt1+dCnZ(:,1:npmax(jz)-1);
CnBt2=CnBt2+dCnB(:,2:npmax(jz));
CnEt2=CnEt2+dCnE(:,2:npmax(jz));
CnEt1=CnEt1+dCnE(:,1:npmax(jz)-1);
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
if (npmax(jz)>nphth) % connecting commands
CnZ(:,2:npmax(jz),jl)=CnZt2;
CnE(:,2:npmax(jz),jl)=CnEt2;
CnB(:,2:npmax(jz),jl)=CnBt2;
CnZ(:,1,jl)=CnZt1(:,1);
CnE(:,1,jl)=CnEt1(:,1);
end
end
Method 4: original (6.7951 sec)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
auto-expanding modified (7.8393 sec)
Ipc=I(:,jt2);
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl)).*Ipc*dt/hb+Cn B(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnE(:,1:npmax(jz)-1,jl).*Ipc*dt/hb+(1-vs2)*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZ(:,2:npmax(jz),jl).*Ipc*dt/hb-(beta0*CnZ(:,2:npmax(jz),jl).*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc.^2))*dt/(2*hb);
bsxfun modified (34.3400 sec)
outside the loop
fun1=@(g,h) g.*h.^2;
inside the loop
Ipc=I(:,jt2);
dCnE(:,2:npmax(jz))=bsxfun(@times,sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+vs2*sigma*CnE(:,1:npmax(jz)-1,jl),Ipc)*dt/hb+CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZ(:,1:npmax(jz)-1,jl).*(Ipc.^2))*dt/(2*hb)-CnE(:,2:npmax(jz),jl)/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*bsxfun(@times,CnE(:,1:npmax(jz)-1,jl),Ipc)*dt/hb+(1-vs2)*(beta0*bsxfun(fun1,CnZ(:,1:npmax(jz)-1,jl),Ipc))*dt/(2*hb)...
-CnB(:,2:npmax(jz),jl)*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*bsxfun(@times,CnZ(:,2:npmax(jz),jl),Ipc)*dt/hb-(beta0*bsxfun(fun1,CnZ(:,2:npmax(jz),jl),Ipc))*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc.^2))*dt/(2*hb);
Maybe the reason is that in modified version, every time Ipc is used (it's used several times in one loop), it's auto-expanded, so the total consumed time is more than the original one, which expand Ipc only one time in one loop. Maybe the bsxfun case has the same reason.
non0-moved Method:
After testing several choices, only the following modification was found a little bit faster. To move the cleaning command of Cn,Cn0 out of the inner loop, I have to exchange the counter of outer loop and inner loop.
original (6.7660 sec)
for jl=1:ns
for jt2=1:(nt+1)
...
...
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
Modified (6.7522 sec)
for jt2=1:(nt+1)
for jl=1:ns
...
...
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
end
Cn0(Cn0<0)=0;
Cn(Cn<0)=0;
end
I appreciate all of your suggestions! But why I can't use gpu to accelerate it? Is my thought in the main post correct ?

Matt J on 20 May 2018
Edited: Matt J on 20 May 2018
Statements like "2:npmax(jz)" which construct index arrays on the right hand side cost, and you are incurring that cost repeatedly by re-executing them unnecessarily.
dCnE(:,2:npmax(jz))=(sigma0*CnZ(:,2:npmax(jz),jl)-sigma*CnE(:,2:npmax(jz),jl)+...
Do things like
indices=2:npmax(jz); %pre-compute
dCnE(:,indices)=(sigma0*CnZ(:,indices,jl)-sigma*CnE(:,indices,jl)+...
Extracting sub-arrays like CnZ(:,2:npmax(jz),jl) is similarly costly. Use temporary variables to avoid this as well, e.g.
temp=CnZ(:,2:npmax(jz),jl);
dCnZ(:,2:npmax(jz))=-sigma0*tmp.*Ipc*dt/hb-(beta0*tmp.*Ipc.^2)*dt/(2*hb)+CnE(:,1:npmax(jz)-1,jl)/tau1*dt;
Finally, repmat should be unnecessary if you are using a Matlab version more recent than R2016b.

Show 1 older comment
ben huang on 21 May 2018
I have tried using temporary arrays without gpu as follows, but the execution time of this block is 9.6524 sec, which is 2.9064 sec more than the original one(without gpu too).
toc
for jl=1:ns
for jt2=1:(nt+1)
dI=(-sigma0*Cn0(:,jl)-sigma*Cn(:,jl)).*I(:,jt2)*dl-(beta0*Cn0(:,jl).*I(:,jt2).^2)*dl;
dCn=(sigma0*Cn0(:,jl)-(1-vs2)*sigma*Cn(:,jl)).*I(:,jt2)*dt/hb+Cn2(:,jl)*(1-exp(-dt/tau2))...
+vs2*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)-Cn(:,jl)/tau1*dt; %(change of S1 state population density)
dCn2=(1-vs2)*sigma*Cn(:,jl).*I(:,jt2)*dt/hb+(1-vs2)*(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)...
-Cn2(:,jl)*(1-exp(-dt/tau2));
dCn0=-sigma0*Cn0(:,jl).*I(:,jt2)*dt/hb-(beta0*Cn0(:,jl).*(I(:,jt2).^2))*dt/(2*hb)+Cn(:,jl)/tau1*dt;
if (npmax(jz)>nphth)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
CnZt2=CnZ(:,2:npmax(jz),jl); % temporary array
CnEt2=CnE(:,2:npmax(jz),jl);
CnBt2=CnB(:,2:npmax(jz),jl);
CnZt1=CnZ(:,1:npmax(jz)-1,jl);
CnEt1=CnE(:,1:npmax(jz)-1,jl);
dCnE(:,2:npmax(jz))=(sigma0*CnZt2-sigma*CnEt2+vs2*sigma*CnEt1).*Ipc*dt/hb+CnBt2*(1-exp(-dt/tau2))...
+vs2*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)-CnEt2/tau1*dt;
dCnB(:,2:npmax(jz))=(1-vs2)*sigma*CnEt1.*Ipc*dt/hb+(1-vs2)*(beta0*CnZt1.*(Ipc.^2))*dt/(2*hb)...
-CnBt2*(1-exp(-dt/tau2));
dCnZ(:,2:npmax(jz))=-sigma0*CnZt2.*Ipc*dt/hb-(beta0*CnZt2.*Ipc.^2)*dt/(2*hb)+CnEt1/tau1*dt;
dCnE(:,1)=(sigma0*CnZ(:,1,jl)-sigma*CnE(:,1,jl)).*Ipc(:,1)*dt/hb+...
-CnE(:,1,jl)/tau1*dt;
dCnZ(:,1)=-sigma0*CnZ(:,1,jl).*Ipc(:,1)*dt/hb-(beta0*CnZ(:,1,jl).*(Ipc(:,1).^2))*dt/(2*hb);
CnZ(:,:,jl)=CnZ(:,:,jl)+dCnZ;
CnB(:,:,jl)=CnB(:,:,jl)+dCnB;
CnE(:,:,jl)=CnE(:,:,jl)+dCnE;
end
I(:,jt2)=I(:,jt2)+dI;
non0=I(:,jt2);
non0(non0<0)=0;
I(:,jt2)=non0;
Cn(:,jl)=Cn(:,jl)+dCn;
Cn0(:,jl)=Cn0(:,jl)+dCn0;
Cn2(:,jl)=Cn2(:,jl)+dCn2;
non0=Cn(:,jl);
non0(non0<0)=0;
Cn(:,jl)=non0;
non0=Cn0(:,jl);
non0(non0<0)=0;
Cn0(:,jl)=non0;
end
end
toc
I think maybe the reason is that the cost of producing a temporary array is more than the saved time of using them, because these temporary arrays is used only two or three times.
Matt J on 21 May 2018
I don't see any temporary arrays in what you posted. It looks identical to your original code.
ben huang on 21 May 2018
Sorry for not specifying the changed part! Please refer to the following code, which is the changed part only.
original (6.7460 sec)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
dCnE(: , 2 : npmax(jz)) = ( sigma0 * CnZ(: , 2 : npmax(jz) , jl) - sigma * CnE(: , 2 : npmax(jz) , jl) +...
vs2 * sigma * CnE(: , 1 : npmax(jz) - 1 , jl)) .* Ipc * dt / hb + CnB(: , 2 : npmax(jz) , jl) * ( 1 - exp( -dt / tau2 ))...
+ vs2 * ( beta0 * CnZ(: , 1 : npmax(jz) - 1 , jl) .* ( Ipc .^ 2 )) * dt / ( 2 * hb )- CnE(: , 2 : npmax(jz) , jl) / tau1 * dt ;
dCnB(: , 2 : npmax(jz)) = ( 1 - vs2 ) * sigma * CnE(: , 1 : npmax(jz) - 1 , jl) .* Ipc * dt / hb + ( 1 - vs2 ) * ( beta0 * CnZ(: , 1 : npmax(jz) - 1 , jl) .*( Ipc .^ 2 )) * dt / ( 2 * hb )...
- CnB(: , 2 : npmax(jz) , jl) *( 1 - exp ( -dt / tau2 ));
dCnZ(: , 2 : npmax(jz)) = -sigma0 * CnZ(: , 2 : npmax(jz) , jl) .* Ipc * dt / hb - ( beta0 * CnZ(: , 2 : npmax(jz) , jl) .* Ipc .^ 2 ) * dt / ( 2 * hb ) + CnE(: , 1 : npmax(jz) - 1 , jl) / tau1 * dt ;
Modified (9.6524 sec)
Ipc=repmat(I(:,jt2),[1,npmax(jz)-1]);
CnZt2 = CnZ(: , 2 : npmax(jz) , jl); % temporary array
CnEt2 = CnE(: , 2 : npmax(jz) , jl);
CnBt2 = CnB(: , 2 : npmax(jz) , jl);
CnZt1 = CnZ(: , 1 : npmax(jz) - 1 , jl);
CnEt1 = CnE(: , 1 : npmax(jz) - 1 , jl);
dCnE(: , 2 : npmax(jz)) = ( sigma0 * CnZt2 - sigma * CnEt2 + vs2 * sigma * CnEt1 ) .* Ipc * dt / hb + CnBt2 *( 1 - exp( -dt / tau2 ))...
+vs2 * ( beta0 * CnZt1 .* ( Ipc .^ 2 )) * dt / ( 2 * hb )- CnEt2 / tau1 * dt ;
dCnB(: , 2 : npmax(jz)) = ( 1 - vs2 ) * sigma * CnEt1 .* Ipc * dt / hb + ( 1 - vs2 ) * ( beta0 * CnZt1 .* ( Ipc .^ 2 )) * dt / ( 2 * hb )...
- CnBt2 * ( 1 - exp ( -dt / tau2 )) ;
dCnZ(: , 2 : npmax(jz)) = -sigma0 * CnZt2 .* Ipc * dt / hb - ( beta0 * CnZt2 .* Ipc .^ 2 ) * dt / ( 2 * hb ) + CnEt1 / tau1 * dt ;