- Yes! Isn't that great?
- Yes, because there are two problems with your code: (a) you're using a lot of for loops instead vector operations and (b) you're measuring GPU performance incorrectly. To fix (a), you should read this doc page that explains how to vectorize your code to get the best performance. To fix (b), you should take a look at my answer to a previous question and use the functions timeit and gputimeit.
GPU utilization and parallel computation With Matlab for heavy computation
12 views (last 30 days)
Show older comments
I have decent/ok machine with core i7 (8 cores), 32G of RAM and Nvidia geForce GTX 1080i and running Matlab 2018b. At the moment I am a bit confuse about how to use these resources in best way to run my Monte-Carlo simulation code. The two questions I have now:
1- How can I make all the heavy computaion to be run on the GPU alonside parallel compution capability of Matlab rather than the CPU and hence I can decide what is best to use? I have read different help topics and the conclusion I think I have got is, the data I have to work with should be in the form of gpuArray Am I right? or do I miss something here?. let us assume that I have the foollowing simple code to be run on GPU :
First_Vector=zeros(2,3);
% First_Vector=zeros(2,3,'gpuArray'); 1
[N,M]=size(First_Vector);
%[N,M]=size(First_Vector,'gpuArray'); 2
Second_Matrix=ones(N,M,2);
%Second_Matrix=ones(N,M,2,'gpuArray'); 3
Tset1= [20 20 20:30 30 30];
%Tset1gpuArray=gpuArray(Tset1); 4
Test2= [50 50 50;60 60 60];
%Tset2=gpuArray(Tset2); 5
K=100;
% the main code
for i=1:3
for j=1:3
[element]=Function1(test(i,j),K)
Test1(i,j)=element;
end
end
Second_Matrix(:,:,1)=Test1;
[Test1]=Function2(Test1,Test2);
% End of the main code
%% Function 1
function[outcome]=Function1(A,K)
outcome=A+K;
end
%%Function 2
function[T1]=Function2(T1,T2)
T1=T1+T2;
end
does the commented lines (1-5) are enough to run the 'main code' on the GPU?
2- I have tested the following simple code on GPU and CPU, CPU performance was by far better than GPU. is that supposed to be normal ?
thanks in advanced.
G = ones(10,10,'gpuArray');
tic
for k=1:100
for i=1: 1000
for j=1:10
G(j,:)=G(j,:)+2;
end
end
end
toc
G = ones(10,10);
tic
for k=1:100
for i=1: 1000
for j=1:10
G(j,:)=G(j,:)+2;
end
end
end
toc
% Elapsed time is 0.628241 seconds.
0 Comments
Accepted Answer
Andrea Picciau
on 3 Jul 2019
Edited: Andrea Picciau
on 3 Jul 2019
I'll try to answer your questions in order...
3 Comments
Andrea Picciau
on 9 Jul 2019
A quick comment about what I meant by vectorising your code: I was looking at this bit here
for k = 1:100
for i = 1:1000
for j = 1:10
G(j,:) = G(j,:) + 2;
end
end
end
which could really be written as
G = G + 200000;
I imagined you were just trying to benchmark the same operation executed on a for loop, so I wrote a quick script for that. I benchmarked three versions of the same algorithm:
- the fully vectorised version,
- a for loop with some vectorisation,
- a for loop without vectorisation.
My GPU is a Tesla K40c and my processor is an Intel Xeon E5-1650.
Let me show you my script:
numRows = 1000;
cpuData = ones(numRows, numRows);
gpuData = gpuArray(cpuData);
timeit(@() iVectorised(cpuData), 1) % 0.0030 seconds
gputimeit(@() iVectorised(gpuData), 1) % 9.3611e-05 seconds
timeit(@() iForLoop(cpuData), 1) % 0.0145 seconds
gputimeit(@() iForLoop(gpuData), 1) % 0.0011 seconds
timeit(@() iForLoopWithIndexing(cpuData, numRows), 1) % 0.2310 seconds
gputimeit(@() iForLoopWithIndexing(gpuData, numRows), 1) % 12.6261 seconds
%% HELPER FUNCTIONS
function dataOut = iVectorised(dataIn)
% Completely vectorised
dataOut = dataIn + 200;
end
function dataOut = iForLoop(dataIn)
% Partially vectorised, external for loop remains
for k = 1:100
dataIn = dataIn + 2;
end
dataOut = dataIn;
end
function dataOut = iForLoopWithIndexing(dataIn, numRows)
% Completely non-vectorised, uses indexing
for k = 1:100
for i = 1:numRows
dataIn(i,:) = dataIn(i,:) + 2;
end
end
dataOut = dataIn;
end
What you're observing is the last case (for loop without vectorisation). The reason it takes so long on the GPU is that indexing gpuArrays is very expensive. For example, you are:
- moving the index i to the GPU,
- creating a temporary gpuArray,
- writing dataIn(i,:) to this temporary GPU array. To do this, you'll have to index dataIn by row rather than by column, (which is faster, usually),
- scheduling dataIn(i,:) + 2 on the GPU,
- assigning the output of this operation back to the right elements of dataIn.
To do most of these things, you need to be communicating back and forward between the CPU and the GPU, which is going to affect your performance (note: this is true for any GPU code, not just if you're using MATLAB). The vectorised version is highly optimised to avoid this ping-pong between your GPU and your CPU.
You also might want to consider larger problems. For example, the data in my script is 1000x1000, which is a reasonable size to start thinking about GPU acceleration.
Putting it all together, I would apply these two golden rules to your Monte Carlo code:
- Reason in matrix and vector operations, not for loops. Vectorise, vectorise, vectorise.
- Think about your overheads. Is the extra communication time worth spending? Should you use a parallel pool or a GPU?
Optimising parallel applications can be a difficult problem, but the rewards can be very high!
More Answers (0)
See Also
Categories
Find more on GPU Computing in MATLAB in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!