GPU speed up for pcg() is disappointing
Show older comments
I am using the pcg() to solve x =A\b. b is about 2e6 long. I am using R2021b on Ubuntu 20.04 with a Ryzen 9 5950x CPU and an nVidia A4000 GPU.
Running this code...
tol = 1e-4;
%solve on the CPU
tic
L = ichol(A, struct('michol','on'));
x = pcg(A, b, tol, 5000, L, L');
fprintf('solve time: %0.4g s\n', toc);
%solve on the GPU
tic
x = pcg(gpuArray(A), b, tol, 50000);
fprintf('solve time: %0.4g s\n', toc);
...gives
pcg converged at iteration 347 to a solution with relative residual 9.6e-05.
solve time: 16.24 s
pcg converged at iteration 7281 to a solution with relative residual 9.9e-05.
solve time: 13.91 s
So, I am seeing a small GPU speed up (nice, but not very exciting). The fact that the GPU takes 20x the iterations of the CPU makes me think there could be a >20x speed up possible, which would be much more exciting.
The L arguments make a big difference to the CPU speed, but I can't get the GPU version to take it. Doing
x = pcg(gpuArray(A), b, tol, 5000, L, L');
throws "Error using gpuArray/pcg (line 58) When the first input argument is a sparse matrix, the second preconditioner cannot be a matrix. Use functions for both preconditioners, or multiply the precondition matrices". Doing
x = pcg(gpuArray(A), b, tol, 50000, L*L');
seems to hang MATLAB (after waiting 5 minutes I gave up and had to terminate by restarting MATLAB; ctrl-C did nothing).
Can anyone tell me what is going on here? Is it simply that the GPU version is using a different algorithm and so it makes no sense to compare number of iterations (and so the 15% speed up I see is all I should hope for)? Or is it that I need a different preconditioning approach?
I see there is a possible bug related to this: https://uk.mathworks.com/support/bugreports/details/2534618.
Accepted Answer
More Answers (2)
Yair Altman
on 8 Sep 2022
The gpuArray version of pcg has not been updated since 2018, so it is somewhat lagging compared to the CPU version. Preconditioner input for sparse input is only supported for a single preconditioner matrix, not two as in the CPU case. Refer to the help of the GPU version (type help('gpuArray/pcg') or edit(...) for details).
Perhaps you could try to use the GPU version with a dense (non-sparse) input - it's wasteful in memory but perhaps possibly faster than the sparse-CPU version:
x = pcg(gpuArray(double(A)), b, tol, 5000, L, L');
7 Comments
Dan R
on 8 Sep 2022
Bruno Luong
on 8 Sep 2022
Can you try this (I don't have Nvidia gc):
pcg(gpuArray(A),b,[],[],@(x)L\x,@(x)L'\x)
Dan R
on 9 Sep 2022
Bruno Luong
on 9 Sep 2022
"user supplied function ==> @(x)gather(L\x) failed with the following error:
Sparse MLDIVIDE only supports sparse square matrices divided by full column vectors."
L is square, so might be x is sparse when the function is called?
One more try
pcg(gpuArray(A),b,[],[],@(x)L\full(x),@(x)L'\full(x))
It looks like pcg is badly designed for gpuArray.
Dan R
on 9 Sep 2022
Bruno Luong
on 9 Sep 2022
Edited: Bruno Luong
on 9 Sep 2022
You shouldn't dream much on 20x acceleration. The preconditioning is efficient only if it improves the condition number and the resolution of M1*M2*x is cheap compared to A*x.
incomplete cholesky ichol and lu ilu are relatively expensive, since it is almost like solving A*x.
That's why people looks for permutations (cheap) that render matrix diagonal dominant, and approximate the matrix by narrow band matrix for M (cheap to solve).
Christine Tobler
on 12 Sep 2022
It looks like you can simply replace your current call to pcg with
x = pcg(A, b, tol, 5000, @(y) L\y, @(y)L'\y);
as the error message just says that it only supports function handle input here (though it's not clear to me why that restriction is there).
Perhaps it makes sense to also cast the matrix L to a gpuArray?
Categories
Find more on Sparse Matrices in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
