Speeding up code which integrates a 2D gpuArray matrix using Simpson's Rule

Question

1 vote

I have a (real) 2D gpuArray, which I am using as part of a larger code, and now am trying to also integrate the array using the Composite Simpson Rule inside my main loop (several 10000 iterations at least). A MWE looks like the following:

%%%%%%%%%%%%%%%%%% MAIN CODE %%%%%%%%%%%%%%%%%%
Ny = 501;    % Dimensions of matrix M
Nx = 501;    %
dx = 0.1;    % Grid spacings
dy = 0.2;    %
M = rand(Ny, Nx, 'gpuArray'); % Initialise a  matrix
for k = 1:10000
    % M = function1(M)  % Apply some other functions to M
    % ... etc ...
    I = simpsons_integration_2D(M, dx, dy, Nx, Ny); % Now integrate M
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%% INTEGRATOR %%%%%%%%%%%%%%%%%%
function I = simpsons_integration_2D(F, dx, dy, Nx, Ny)
% Integrate the 2D function F with Nx columns and Ny rows, and grid spacings
% dx and dy using Simpson's rule.
% Integrate along x direction (vertically) --> IX is a vector afterwards
sX = sum( F(:,1:2:Nx-2) + 4*F(:,2:2:(Nx-1)) + F(:,3:2:Nx) , 2);
IX = dx/3 * sX;
% Integrate along y direction --> I is a scalar afterwards
sY = sum( IX(1:2:Ny-2) + 4*IX(2:2:(Ny-1)) + IX(3:2:Ny) , 1);
I = dy/3 * sY;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The operation of performing the integration is around 850 µs, which is currently a significant part of my code. This was measured using

f = @() simpsons_integration_2D(M, dx, dy, Nx, Ny);
t = gputimeit(f)

Is there a way to reduce the execution time for integrating the gpuArray matrix?

(The graphics card is the Nvidia Quadro P4000)

1 Comment
Show -1 older comments Hide -1 older comments

Jan on 10 Feb 2021

A clear question with MWE - nice! Thanks. +1

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Jan on 10 Feb 2021

Edited: Jan on 10 Feb 2021

Open in MATLAB Online

1 vote

These are faster at least on CPU - I do not have graphics hardware in my laptop which supports GPU arrays:

function I = simpsons_integration_2D_v2(F, dx, dy, Nx, Ny)
% Avoid summation of large intermediate arrays:
sX = F(:,1) + 2 * sum(F(:, 3:2:(Nx-2)), 2) + ...
              4 * sum(F(:, 2:2:(Nx-1)), 2) + F(:,Nx);
sY = sX(1)  + 2 * sum(sX(3:2:(Ny-2)))      + ...
              4 * sum(sX(2:2:(Ny-1)))      + sX(Ny);
I = dx * dy / 9 * sY;
end
function I = simpsons_integration_2D_v3(F, dx, dy, Nx, Ny)
% Perform summation as dot product:
v  = [1, repmat([4, 2], 1, (Nx - 3) / 2), 4, 1];
sX = F * v';
if Nx ~= Ny
   v = [1, repmat([4, 2], 1, (Ny - 3) / 2), 4, 1];
end
sY = v * sX;
I  = dx * dy / 9 * sY;
end

Compared timeit output on my mobil i5:

00111591015   original
00066621015   without large intermediate array
00012221015   as matrix multiplication

So this is faster on my old 2 core i5 than on your GPU. Maybe this gives a speedup on your GPU also. I assume, in version v3 the vector v must be created on the GPU also.

3 Comments
Show 1 older comment Hide 1 older comment

Thomas Barrett on 11 Feb 2021

Open in MATLAB Online

This is great, thank you. Your v2 code performs the best on my machine with the GPU, as it also did on your CPU.

Since Nx and Ny are constant in my main loop, I have moved the creation of your vectors outside the loop (and converted them to gpuArrays as well like you suggested). The test now looks like the following, and I am taking a mean of x100 gputimeit() results:

SimpVec1 = [1, repmat([4, 2], 1, (Nx - 3) / 2), 4, 1]; % Precalculate the Simpson vectors
SimpVec2 = [1, repmat([4, 2], 1, (Ny - 3) / 2), 4, 1];
SimpVec1 = gpuArray(SimpVec1); % Move vectors to GPU
SimpVec2 = gpuArray(SimpVec2);
f_orig = @() simpsons_integration_2D(F, dx, dy, Nx, Ny);
f_v2 = @() simpsons_integration_2D_v2(F, dx, dy, Nx, Ny);
f_v3 = @() simpsons_integration_2D_v3(F, dx, dy, SimpVec1, SimpVec2);
t_orig = zeros(1,100);
t_v2 = zeros(1,100);
t_v3 = zeros(1,100);
for k= 1:100 % Do 100 measurements using gputimeit() for each version
    t_orig(k) = gputimeit( f_orig );
    t_v2(k)   = gputimeit( f_v2 );
    t_v3(k)   = gputimeit( f_v3 );
end
figure;hold all;
plot(t_orig*1e3);plot(t_v2*1e3);plot(t_v3*1e3);
fprintf('Average time per execution for original  = %0.2f [ms] \n', mean(t_orig)*1e3);
fprintf('Average time per execution for Jan''s v2 = %0.2f [ms] \n', mean(t_v2)*1e3);
fprintf('Average time per execution for Jan''s v3 = %0.2f [ms] \n', mean(t_v3)*1e3);

The results are

Average time per execution for original = 0.79 [ms]

Average time per execution for Jan''s v2 = 1.02 [ms]

Average time per execution for Jan''s v3 = 0.23 [ms]

This 230µs represents a speedup of about x3.5 from my original. Clearly I am not even reaching the ~120µs that you obtained on your CPU, but since my array M is already on the GPU (because the other functions in my main loop are optimized when run on the GPU), I think this calculation needs to stay on the GPU. Otherwise I would have to gather() each time in the loop before doing the integration. Do you think this is the best I can do?

Thanks for your help.

Jan on 11 Feb 2021

Edited: Jan on 11 Feb 2021

Open in MATLAB Online

On the CPU it is slightly faster to perform it in one step:

s = vY * F * vX.';
I = s * dx * dy / 9;

But this might be an the effect of the JIT acceleration.

Another approach, which is 50% slower than the v3 version on the CPU:

vX  = [1, repmat([4, 2], 1, (Nx - 3) / 2), 4, 1].';
vY  = [1, repmat([4, 2], 1, (Ny - 3) / 2), 4, 1];
vXY = vY.' * vX.';
function I = simpsons_integration_2D_v5(F, dx, dy, vXY)
s = F(:).' * vXY(:);
I = s * dx * dy / 9;
end

Thomas Barrett on 11 Feb 2021

Edited: Thomas Barrett on 11 Feb 2021

Open in MATLAB Online

Hi @Jan. Thank you. Along with the help of rahnema1 over on SE (https://stackoverflow.com/q/66140278/12921500), we arrived at a quick method by expanding the computation into a dot product. Since none of Nx, Ny, dx or dy are changing in the main loop, much of this can be precalculated into a single vector Simp_coeff as follows:

function Simp_Coeffs = generate_coeffs_for_SimpsonIntegration(dx, dy, Nx, Ny)
% Generate coefficients for integrating a 2D function with Nx columns and Ny rows,
% and grid spacings dx and dy, using Simpson's rule.
% If Nx and Ny are odd the integral can be calculated directly using Simpson's 
% composite rule. If Nx/Ny are even, the composite Simpson's rule is used for the 
% first (n-3) = even strips, and the Simpson's 3/8 rule is used for the final three strips.
% Credit for this goes here: https://scicomp.stackexchange.com/a/25669/38005
% and here: https://stackoverflow.com/q/66140278/12921500
if rem(Nx,2) % If number of columns is odd
    SimpVec1 = (dx/3)*[1, repmat([4, 2], 1, (Nx-3)/2), 4, 1]; 
else         % If number of columns is even
    SimpVec1 = dx*[1/3, repmat([4, 2]/3, 1, (Nx-6)/2), 4/3, [17/3 9 9 3]/8 ];
end
if rem(Ny,2) % If number of rows is odd
    SimpVec2 = (dy/3)*[1, repmat([4, 2], 1, (Ny-3)/2), 4, 1];
else         % If number of rows is even
    SimpVec2 = dy*[1/3, repmat([4, 2]/3, 1, (Ny-6)/2), 4/3, [17/3 9 9 3]/8 ];
end
% Convert from matrix to vector, ready to perform dot product later
Simp_Coeffs = reshape(SimpVec1 .* SimpVec2.', 1, []);
end

where I have now accounted for the possibility of an even number of points being supplied (Simpson's composite rule only works for odd number of points usually).

The integrator function now looks like:

function I = simpsons_integration_2D_dotProduct(F, Simp_coeffs)
% Perform integration using Simpson's composite rule as dot product
I = Simp_coeffs * F(:);
end

and the main code looks like:

Simp_Coeffs = generate_coeffs_for_SimpsonIntegration(dx, dy, Nx, Ny); 
Simp_Coeffs = gpuArray(Simp_Coeffs); % Move coefficients to GPU
%%% Main Loop %%%
for k = 1:10000
    % M = function1(M)  % Apply some other functions to M
    % ... etc ...
    I = simpsons_integration_2D_dotProduct(F, Simp_coeffs); % Now integrate M
end
%%%%%%%%%%%%%%%%%

This reduces the time taken to 77 µs down from 230 µs on my GPU (again using gputimeit, averaged over 1000 runs), which is more than a factor of x11 speedup over the 850 µs of my code in the OP.

Thank you very much, I have learned a lot with this exercise.

Sign in to comment.

Speeding up code which integrates a 2D gpuArray matrix using Simpson's Rule

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

3 Comments
Show 1 older comment Hide 1 older comment

More Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

Speeding up code which integrates a 2D gpuArray matrix using Simpson's Rule

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

3 Comments Show 1 older comment Hide 1 older comment

More Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

3 Comments
Show 1 older comment Hide 1 older comment