GPU speed up for pcg() is disappointing

I am using the pcg() to solve x =A\b. b is about 2e6 long. I am using R2021b on Ubuntu 20.04 with a Ryzen 9 5950x CPU and an nVidia A4000 GPU.
Running this code...
tol = 1e-4;
%solve on the CPU
tic
L = ichol(A, struct('michol','on'));
x = pcg(A, b, tol, 5000, L, L');
fprintf('solve time: %0.4g s\n', toc);
%solve on the GPU
tic
x = pcg(gpuArray(A), b, tol, 50000);
fprintf('solve time: %0.4g s\n', toc);
...gives
pcg converged at iteration 347 to a solution with relative residual 9.6e-05.
solve time: 16.24 s
pcg converged at iteration 7281 to a solution with relative residual 9.9e-05.
solve time: 13.91 s
So, I am seeing a small GPU speed up (nice, but not very exciting). The fact that the GPU takes 20x the iterations of the CPU makes me think there could be a >20x speed up possible, which would be much more exciting.
The L arguments make a big difference to the CPU speed, but I can't get the GPU version to take it. Doing
x = pcg(gpuArray(A), b, tol, 5000, L, L');
throws "Error using gpuArray/pcg (line 58) When the first input argument is a sparse matrix, the second preconditioner cannot be a matrix. Use functions for both preconditioners, or multiply the precondition matrices". Doing
x = pcg(gpuArray(A), b, tol, 50000, L*L');
seems to hang MATLAB (after waiting 5 minutes I gave up and had to terminate by restarting MATLAB; ctrl-C did nothing).
Can anyone tell me what is going on here? Is it simply that the GPU version is using a different algorithm and so it makes no sense to compare number of iterations (and so the 15% speed up I see is all I should hope for)? Or is it that I need a different preconditioning approach?
I see there is a possible bug related to this: https://uk.mathworks.com/support/bugreports/details/2534618.

 Accepted Answer

I'm guessing LL' is extremely dense, which will explain why the solver stalls. On the GPU the preconditioning is (currently) performed using ILU, which, like most sparse operations should be passed a satisfactorily sparse matrix. Try just passing A as the preconditioner and you may get a better result. Also, try the other solvers (CGS, GMRES, LSQR, BICG etc).
The reason why the solvers on the GPU work differently is because a sparse direct solve does not parallelize well, which is why sparse backslash (\) is generally slow - mostly because of the amount of memory needed. That doesn't explain why the solvers do not accept two triangular sparse matrices as preconditioner input - that is something that should be rectified. But the ILU should have much the same effect as ICHOL does.
I thought I'd be telling you that the GPU is slow because your card isn't very fast for double precision (only 599 GFLOPS). But actually you're doing 20 times the iterations in less time, so it seems you're right, if you hit the right combination of solver and preconditioner there's a good chance you'll get to the result much faster.

20 Comments

Thanks for the explanation and I can see the rationale... but it seems L*L' is not much denser than A (see pics below).
Interestingly, doing the following on the CPU:
x = pcg(A, b, tol, 5000, A);
I get
pcg converged at iteration 1 to a solution with relative residual 1.4e-12.
To me, this looks like pcg() is actually choosing to do x=A\b (it itakes exactly the same amount of time as A\b). I guess this is some sort of heuristic internal to pcg()?
IIRC I did try this same line on the GPU before and it stalled (I'm not at work so can't try the GPU until tomorrow). If pcg() also tries to do A\b directly on the GPU then this would explain the stall given that backslash does not parralise well.
I'll explore the other solvers tomorrow, but I was under the impression that pcg() was the one to go for for positive definite symmetric problems? (This is certainly what I took from the MATLAB docs anyway).
This is a standard finite difference problem so it must be typical use case for pcg() and therefore a good target fix for a future MATLAB release :-). I'm very happy to share A and b if this would help!
"To me, this looks like pcg() is actually choosing to do x=A\b (it itakes exactly the same amount of time as A\b). I guess this is some sort of heuristic internal to pcg()?"
No the conditioning by M is like transforming
A*x = b
to
M*x = y
[A*inv(M)] * y = b
It won't do like this but using gradient conjugate iteration to minimize
1/2* || [A*inv(M)] * y - b ||^2
Each iteration it require computing
[A*inv(M)] * pk
where pk is the conjugate direction, which is carried out more or less as
A*(M \ pk)
If M is A, the A*inv(M) is identity matrix (cond=1), the pcg converges in one iteration but it shift the calculation to (M \ pk) which is (A \ pk), and pk is b!
So it is nice-try idea but totally ineffective (sorry @Joss Knight) to precondioning using A itself. It just shifting the whole calculation to somewhere else.
That's a satisfying explanation for what I saw. Thanks!
If you pass a preconditioner to GPU PCG then it is decomposed using ILU and the triangular factors used to perform the preconditioning. A full direct solve is not performed (sorry @Bruno Luong). Although that is what happens on CPU. Anyway, give it a try.
Maybe computing the ILU is problematic for your matrix, although I can't for life of me think why, with that sparsity pattern. If you could provide your matrix and RHS as a mat-file I can look into it.
The plot thickens. I will try with an A preconditioner on the GPU tomorrow but I am (almost) sure that it stalls for >5min.
I can't attach A and b as it's too big... try this link: https://oxfile.ox.ac.uk/oxfile/work/extBox?id=1183718492CCFDCA664.
Joss Knight
Joss Knight on 12 Sep 2022
Edited: Joss Knight on 12 Sep 2022
Thank you! I can reproduce your issue and it fits a known issue: https://www.mathworks.com/support/bugreports/details/2534618
This is caused by a change in the implementation of sparse triangular solve in cusparse and requires an updated implementation in some third party software. But we're on it.
In the meantime if you have access to R2020b or earler you'll find the problem alleviated.
However, that said, when I did this in 20b, the rest of the solve was slower, which meant that the preconditioned solve was still slower than the unpreconditioned solve is now. Darn!
I tried to find another solver that worked better than PCG but there was no luck.
FWIW a preconditioner of diag(diag(A)) reduced the number of iterations, but was still a bit slower.
What about twin preconditioner case where @Dan R asked?
L = ichol(A, struct('michol','on'));
% L = gpuArray(L);
x = pcg(gpuArray(A), b, tol, 5000, L, L');
Why this syntax is not supported? Are you going to remove the restriction?
Dan R
Dan R on 12 Sep 2022
Edited: Dan R on 12 Sep 2022
Thanks for investigating, Joss. At least we reached a firm conclusion and I look forward to the next release :-).
Thanks also Bruno for your input. I second his request to allow passing of the 'split' preconditioner as this seems highly effective in the CPU implementation (but of course this may not trasnslate directly to the GPU version).
In this case, supporting two input preconditioners wouldn't have helped because the performance issue is in the triangular solve. However, it can be a huge advantage so yes, we want to do it.
Why isn't it supported? Basically because it would have to be restricted to a lower triangular matrix for M1 and upper triangular for M2. But I think we've come to that!
Dan R
Dan R on 12 Sep 2022
Edited: Dan R on 12 Sep 2022
Ok, one final question: Is there a way to make pcg() use single precision? This comes from noting the massive increase in FLOPS of most GPUs (including my A4000) when going double->single. My tol parameter is only 1e-4 so I could take quite a hit on numerical precision as long as the algorithm remains stable.
Assuming a future version of pcg() hits a ~2.5s solve time with doubles, divided by the ratio of single to double performance gives sub-100 ms solve times which would be spectacular. I'm sure this is optimistic but a man can dream (I guess it may be memory bandwidth limited and so give only a 2x speed up...?)
Unfortunately base MATLAB does not support single precision sparse and therefore neither does the GPU. The main reason is that 1) sparse solves are often numerically unstable in single precision and 2) the backend libraries we use traditionally did not support single. However, 2 is no longer true so we're hoping to get to this. I've added your request to our database.
Well, I'm definitely looking forward to that release! Sparse A\b is at the bottom of many engineering problems, so if such speed ups are even remotely possible, it would be a worthwhile exercise.
Any idea how long we might be waiting? Will this thread get an update at that point, or do I need to keep checking the changelog?
@Dan R you might take a look at https://people.engr.tamu.edu/davis/suitesparse.html by Tim Davis.
He used to contribute to MATLAB community and claims to be origin of backslash (at least for sparse). He seems non longer involve in MATLAB (I might be wrong on that), but it looks like he still maintain a library of sparse matrix. A knowledgeable person and I was happy to get help from him when he was still participating to MATLAB newsgroup.
Hi Dan. I'll try to remember to update this thread when it's done.
I'm hoping the speed improvement will be available next year, support for two triangular preconditioners later than that. Single precision sparse...unknown. As you can imagine, because of our quality requirements, even if I make a change now you would not see it for many months.
Thank you Joss. I'll keep checking the release notes.
Bruno, thanks for the pointer to SuiteSparse. I will take a look.
Hello @Joss Knight.
I wonder if a solution to this issue was found.
Thank you
The bug report shows that this was fixed in R2024a. It took so long because it involved waiting for a fix from NVIDIA who were also changing their APIs at the time. Anyway, I reran the original reproduction steps and can confirm that it's fixed in R2024a and the GPU performance is considerably better than CPU even on a device designed mainly for single precision.
>> tic; pcg(A,b,tol,5000,L,L'); toc
pcg converged at iteration 347 to a solution with relative residual 9.8e-05.
Elapsed time is 30.916549 seconds.
>> tic; pcg(gpuArray(A),b,tol,5000,L*L'); toc
pcg converged at iteration 343 to a solution with relative residual 0.0001.
Elapsed time is 4.651888 seconds.
In R2024b, coming out in a few weeks, we have added support for two triangular preconditioners to the GPU sparse solvers. Using this you get an even better performance as you would expect.
>> tic; pcg(gpuArray(A),b,tol,50000,L,L'); toc
pcg converged at iteration 343 to a solution with relative residual 9.7e-05.
Elapsed time is 3.093898 seconds.
Hello Joss.
Great! thank you. I appreciate the update.
I have just tired R2024b and got a 15s solve time (nVidia A4000 GPU) vs. 107s solve time (Ryzen 5950x CPU) for an example problem. A ~7.1x peed up from changing one line of code is nice!
Now if only we could have a single precision option...
Thanks for the update, Joss (and to others that helped).

Sign in to comment.

More Answers (2)

The gpuArray version of pcg has not been updated since 2018, so it is somewhat lagging compared to the CPU version. Preconditioner input for sparse input is only supported for a single preconditioner matrix, not two as in the CPU case. Refer to the help of the GPU version (type help('gpuArray/pcg') or edit(...) for details).
Perhaps you could try to use the GPU version with a dense (non-sparse) input - it's wasteful in memory but perhaps possibly faster than the sparse-CPU version:
x = pcg(gpuArray(double(A)), b, tol, 5000, L, L');

7 Comments

Thanks for the hint on the help. The relevent lines are:
For sparse matrix input A, the following apply:
* Only one sparse matrix preconditioner M1 is supported.
* To use two preconditioners, M1 and M2, they must both be functions.
Do you have an hints how I should prepare a single preconditioning matrix for the GPU version? My attempt to use L*L' does not seem to work (see my first post).
I'm not sure if the second point implies that I could write a functional form for M1 and M2 and this would work for sparse matrices, or if this only applies to full matrices.
Unfortunately using double() would mean I need a graphics card with 8*(2e6)^2 = 3.2TB of memory, so I can't go that route!
Can you try this (I don't have Nvidia gc):
pcg(gpuArray(A),b,[],[],@(x)L\x,@(x)L'\x)
Thanks for the suggestion. For
pcg(gpuArray(A),b,[],[],@(x)L\x,@(x)L'\x)
I get
Error using iterapp (line 66)
user supplied function ==> @(x)gather(L\x) failed with the following error:
Sparse MLDIVIDE only supports sparse square matrices divided by full column vectors.
Error in parallel.internal.flowthrough.pcg (line 198)
y = iterapp('mldivide',m1fun,m1type,m1fcnstr,r,varargin{:});
Error in gpuArray/pcg (line 65)
[varargout{1:nargout}] = parallel.internal.flowthrough.pcg(varargin{:});
For
pcg(gpuArray(A),b,[],[],@(x)L\double(x),@(x)L'\double(x))
and
pcg(gpuArray(A),b,[],[],@(x)gather(L)\double(x),@(x)gather(L')\double(x))
I get
Error using iterapp (line 66)
user supplied function ==> @(x)L\double(x) failed with the following error:
Sparse gpuArrays are not supported for this function.
Error in parallel.internal.flowthrough.pcg (line 198)
y = iterapp('mldivide',m1fun,m1type,m1fcnstr,r,varargin{:});
Error in gpuArray/pcg (line 65)
[varargout{1:nargout}] = parallel.internal.flowthrough.pcg(varargin{:});
So to me it looks like MATLAB won't let us pass a gpuArray to the user function, so this apporoach can never work.
"user supplied function ==> @(x)gather(L\x) failed with the following error:
Sparse MLDIVIDE only supports sparse square matrices divided by full column vectors."
L is square, so might be x is sparse when the function is called?
One more try
pcg(gpuArray(A),b,[],[],@(x)L\full(x),@(x)L'\full(x))
It looks like pcg is badly designed for gpuArray.
That works in that it doesn't throw an error. But it takes >5min to run (I eventually forced it to stop). I can't see any GPU activity on the nVidia monitor, and I see 100% CPU on one core. So it looks like this has pushed most (or all) of the work back on to the CPU :-(.
I agree, I think pcg() for sparse inputs on the GPU needs an update to match the good performance of the CPU version. A 20x speedup (which is what the CPU vs. GPU iteration counts suggest) would be very impressive. Perhaps Mathworks can add this to their feature request list?
Bruno Luong
Bruno Luong on 9 Sep 2022
Edited: Bruno Luong on 9 Sep 2022
You shouldn't dream much on 20x acceleration. The preconditioning is efficient only if it improves the condition number and the resolution of M1*M2*x is cheap compared to A*x.
incomplete cholesky ichol and lu ilu are relatively expensive, since it is almost like solving A*x.
That's why people looks for permutations (cheap) that render matrix diagonal dominant, and approximate the matrix by narrow band matrix for M (cheap to solve).
But ichol() seems fast for me:
tic; L = ichol(A, struct('michol','on')); toc; %this runs on the CPU and takes ~0.07s
tic; x = pcg(A, b, tol, 5000, L, L'); toc; %this runs on the CPU and takes ~16s
In summary:
  1. CPU version of pcg() with L: 347 iterations, 16 sec.
  2. CPU version of pcg() without L: 7285 iterations, 106 sec.
  3. GPU version of pcg() with L: does not work.
  4. GPU version of pcg() without L: 7281 iterations, 14 sec.
2 and 4 look very similar in terms of behaviour. I am hoping that 1 and 3 would also have similar behaviour with a corresponding speed up of 106/16 = 6.6 times (so not 20x as I said above), i.e. I expect/hope 3 would take 14/6.6 = 2.1 s if it could be made to work.
I confess I am not an expert on the numerical methods being used here, so I may be missing something...

Sign in to comment.

It looks like you can simply replace your current call to pcg with
x = pcg(A, b, tol, 5000, @(y) L\y, @(y)L'\y);
as the error message just says that it only supports function handle input here (though it's not clear to me why that restriction is there).
Perhaps it makes sense to also cast the matrix L to a gpuArray?

1 Comment

I tried this after advice from Bruno Luong above - no luck I'm afraid.

Sign in to comment.

Categories

Find more on Sparse Matrices in Help Center and File Exchange

Products

Release

R2021b

Tags

Asked:

on 7 Sep 2022

Commented:

on 19 Sep 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!