Speeding up code which integrates a 2D gpuArray matrix using Simpson's Rule

I have a (real) 2D gpuArray, which I am using as part of a larger code, and now am trying to also integrate the array using the Composite Simpson Rule inside my main loop (several 10000 iterations at least). A MWE looks like the following:
%%%%%%%%%%%%%%%%%% MAIN CODE %%%%%%%%%%%%%%%%%%
Ny = 501; % Dimensions of matrix M
Nx = 501; %
dx = 0.1; % Grid spacings
dy = 0.2; %
M = rand(Ny, Nx, 'gpuArray'); % Initialise a matrix
for k = 1:10000
% M = function1(M) % Apply some other functions to M
% ... etc ...
I = simpsons_integration_2D(M, dx, dy, Nx, Ny); % Now integrate M
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%% INTEGRATOR %%%%%%%%%%%%%%%%%%
function I = simpsons_integration_2D(F, dx, dy, Nx, Ny)
% Integrate the 2D function F with Nx columns and Ny rows, and grid spacings
% dx and dy using Simpson's rule.
% Integrate along x direction (vertically) --> IX is a vector afterwards
sX = sum( F(:,1:2:Nx-2) + 4*F(:,2:2:(Nx-1)) + F(:,3:2:Nx) , 2);
IX = dx/3 * sX;
% Integrate along y direction --> I is a scalar afterwards
sY = sum( IX(1:2:Ny-2) + 4*IX(2:2:(Ny-1)) + IX(3:2:Ny) , 1);
I = dy/3 * sY;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The operation of performing the integration is around 850 µs, which is currently a significant part of my code. This was measured using
f = @() simpsons_integration_2D(M, dx, dy, Nx, Ny);
t = gputimeit(f)
Is there a way to reduce the execution time for integrating the gpuArray matrix?
(The graphics card is the Nvidia Quadro P4000)

 Accepted Answer

These are faster at least on CPU - I do not have graphics hardware in my laptop which supports GPU arrays:
function I = simpsons_integration_2D_v2(F, dx, dy, Nx, Ny)
% Avoid summation of large intermediate arrays:
sX = F(:,1) + 2 * sum(F(:, 3:2:(Nx-2)), 2) + ...
4 * sum(F(:, 2:2:(Nx-1)), 2) + F(:,Nx);
sY = sX(1) + 2 * sum(sX(3:2:(Ny-2))) + ...
4 * sum(sX(2:2:(Ny-1))) + sX(Ny);
I = dx * dy / 9 * sY;
end
function I = simpsons_integration_2D_v3(F, dx, dy, Nx, Ny)
% Perform summation as dot product:
v = [1, repmat([4, 2], 1, (Nx - 3) / 2), 4, 1];
sX = F * v';
if Nx ~= Ny
v = [1, repmat([4, 2], 1, (Ny - 3) / 2), 4, 1];
end
sY = v * sX;
I = dx * dy / 9 * sY;
end
Compared timeit output on my mobil i5:
0.00111591015 original
0.00066621015 without large intermediate array
0.00012221015 as matrix multiplication
So this is faster on my old 2 core i5 than on your GPU. Maybe this gives a speedup on your GPU also. I assume, in version v3 the vector v must be created on the GPU also.

3 Comments

This is great, thank you. Your v2 code performs the best on my machine with the GPU, as it also did on your CPU.
Since Nx and Ny are constant in my main loop, I have moved the creation of your vectors outside the loop (and converted them to gpuArrays as well like you suggested). The test now looks like the following, and I am taking a mean of x100 gputimeit() results:
SimpVec1 = [1, repmat([4, 2], 1, (Nx - 3) / 2), 4, 1]; % Precalculate the Simpson vectors
SimpVec2 = [1, repmat([4, 2], 1, (Ny - 3) / 2), 4, 1];
SimpVec1 = gpuArray(SimpVec1); % Move vectors to GPU
SimpVec2 = gpuArray(SimpVec2);
f_orig = @() simpsons_integration_2D(F, dx, dy, Nx, Ny);
f_v2 = @() simpsons_integration_2D_v2(F, dx, dy, Nx, Ny);
f_v3 = @() simpsons_integration_2D_v3(F, dx, dy, SimpVec1, SimpVec2);
t_orig = zeros(1,100);
t_v2 = zeros(1,100);
t_v3 = zeros(1,100);
for k= 1:100 % Do 100 measurements using gputimeit() for each version
t_orig(k) = gputimeit( f_orig );
t_v2(k) = gputimeit( f_v2 );
t_v3(k) = gputimeit( f_v3 );
end
figure;hold all;
plot(t_orig*1e3);plot(t_v2*1e3);plot(t_v3*1e3);
fprintf('Average time per execution for original = %0.2f [ms] \n', mean(t_orig)*1e3);
fprintf('Average time per execution for Jan''s v2 = %0.2f [ms] \n', mean(t_v2)*1e3);
fprintf('Average time per execution for Jan''s v3 = %0.2f [ms] \n', mean(t_v3)*1e3);
The results are
Average time per execution for original = 0.79 [ms]
Average time per execution for Jan''s v2 = 1.02 [ms]
Average time per execution for Jan''s v3 = 0.23 [ms]
This 230µs represents a speedup of about x3.5 from my original. Clearly I am not even reaching the ~120µs that you obtained on your CPU, but since my array M is already on the GPU (because the other functions in my main loop are optimized when run on the GPU), I think this calculation needs to stay on the GPU. Otherwise I would have to gather() each time in the loop before doing the integration. Do you think this is the best I can do?
Thanks for your help.
On the CPU it is slightly faster to perform it in one step:
s = vY * F * vX.';
I = s * dx * dy / 9;
But this might be an the effect of the JIT acceleration.
Another approach, which is 50% slower than the v3 version on the CPU:
vX = [1, repmat([4, 2], 1, (Nx - 3) / 2), 4, 1].';
vY = [1, repmat([4, 2], 1, (Ny - 3) / 2), 4, 1];
vXY = vY.' * vX.';
function I = simpsons_integration_2D_v5(F, dx, dy, vXY)
s = F(:).' * vXY(:);
I = s * dx * dy / 9;
end
Hi @Jan. Thank you. Along with the help of rahnema1 over on SE (https://stackoverflow.com/q/66140278/12921500), we arrived at a quick method by expanding the computation into a dot product. Since none of Nx, Ny, dx or dy are changing in the main loop, much of this can be precalculated into a single vector Simp_coeff as follows:
function Simp_Coeffs = generate_coeffs_for_SimpsonIntegration(dx, dy, Nx, Ny)
% Generate coefficients for integrating a 2D function with Nx columns and Ny rows,
% and grid spacings dx and dy, using Simpson's rule.
% If Nx and Ny are odd the integral can be calculated directly using Simpson's
% composite rule. If Nx/Ny are even, the composite Simpson's rule is used for the
% first (n-3) = even strips, and the Simpson's 3/8 rule is used for the final three strips.
% Credit for this goes here: https://scicomp.stackexchange.com/a/25669/38005
% and here: https://stackoverflow.com/q/66140278/12921500
if rem(Nx,2) % If number of columns is odd
SimpVec1 = (dx/3)*[1, repmat([4, 2], 1, (Nx-3)/2), 4, 1];
else % If number of columns is even
SimpVec1 = dx*[1/3, repmat([4, 2]/3, 1, (Nx-6)/2), 4/3, [17/3 9 9 3]/8 ];
end
if rem(Ny,2) % If number of rows is odd
SimpVec2 = (dy/3)*[1, repmat([4, 2], 1, (Ny-3)/2), 4, 1];
else % If number of rows is even
SimpVec2 = dy*[1/3, repmat([4, 2]/3, 1, (Ny-6)/2), 4/3, [17/3 9 9 3]/8 ];
end
% Convert from matrix to vector, ready to perform dot product later
Simp_Coeffs = reshape(SimpVec1 .* SimpVec2.', 1, []);
end
where I have now accounted for the possibility of an even number of points being supplied (Simpson's composite rule only works for odd number of points usually).
The integrator function now looks like:
function I = simpsons_integration_2D_dotProduct(F, Simp_coeffs)
% Perform integration using Simpson's composite rule as dot product
I = Simp_coeffs * F(:);
end
and the main code looks like:
Simp_Coeffs = generate_coeffs_for_SimpsonIntegration(dx, dy, Nx, Ny);
Simp_Coeffs = gpuArray(Simp_Coeffs); % Move coefficients to GPU
%%% Main Loop %%%
for k = 1:10000
% M = function1(M) % Apply some other functions to M
% ... etc ...
I = simpsons_integration_2D_dotProduct(F, Simp_coeffs); % Now integrate M
end
%%%%%%%%%%%%%%%%%
This reduces the time taken to 77 µs down from 230 µs on my GPU (again using gputimeit, averaged over 1000 runs), which is more than a factor of x11 speedup over the 850 µs of my code in the OP.
Thank you very much, I have learned a lot with this exercise.

Sign in to comment.

More Answers (0)

Categories

Find more on Parallel Computing Toolbox in Help Center and File Exchange

Products

Release

R2020b

Asked:

on 10 Feb 2021

Edited:

on 11 Feb 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!