Can parfor run a series of GPU programs simultaneously?

5 views (last 30 days)
parfor i = 1:9
% for i = 1:9
sim_e(i).ellipticity = i/10;
prop_output_elliptical(i) = GMMNLSE_propagate(fiber,initial_pulse,sim_e(i),gain_rate_eqn);
end
I am running a parfor code like above, and GMMNLSE_propagate is a function that runs on GPU.
The codes crashes, and below is the error report. Is it OK to run a series of GPU programs using parfor? Thank you for your help.
Starting parallel pool (parpool) using the 'Processes' profile ...
Preserving jobs with IDs: 1 because they contain crash dump files.
You can use 'delete(myCluster.Jobs)' to remove all jobs created with profile Processes. To create 'myCluster' use 'myCluster = parcluster('Processes')'.
Connected to parallel pool with 14 workers.
--------------------------------------------------------------------------------
Access violation detected at 2023-11-02 14:58:13 +0800
--------------------------------------------------------------------------------
Configuration:
Crash Decoding : Disabled - No sandbox or build area path
Crash Mode : continue (default)
Default Encoding : UTF-8
Deployed : false
Graphics Driver : Uninitialized hardware
Graphics card 1 : NVIDIA ( 0x10de ) NVIDIA GeForce RTX 3060 Laptop GPU Version 31.0.15.3713 (2023-8-14)
Interpreter 0 : Executing request: 64657461696C2F44656661756C744D564D2E637070
Java Version : Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
MATLAB Architecture : win64
MATLAB Entitlement ID : 4205217
MATLAB Root : D:\MATLAB
MATLAB Version : 9.14.0.2337262 (R2023a) Update 5
OpenGL : hardware
Operating System : Microsoft Windows 11 家庭中文版
Process ID : 18824
Processor ID : x86 Family 6 Model 154 Stepping 3, GenuineIntel
Session Key : f41c4c38-99fb-43f5-b9cb-be38129f8c29
Window System : Version 10.0 (Build 22621)
Fault Count: 1
Abnormal termination:
Access violation
Current Thread: 'MCR 0 interpreter thread' id 6000
Register State (from fault):
RAX = 0000000000000000 RBX = 0000022d9b00bc10
RCX = 0000022d9c846d30 RDX = 0000022d97b8c4d0
RSP = 0000005687fef490 RBP = 0000022d9c846d30
RSI = 0000022d0ed29c80 RDI = 0000000000000000
R8 = 0000022d3d219480 R9 = 0000000000000001
R10 = 0000000000000003 R11 = 0000005687fef430
R12 = 0000005687fef8c0 R13 = 0000005687fef928
R14 = 0000005687fef6c0 R15 = 0000022d97b8c4d0
RIP = 00007ffb643e1c51 EFL = 00010246
CS = 0033 FS = 0053 GS = 002b
Stack Trace (from fault):
[ 0] 0x00007ffb643e1c51 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+02104401
[ 1] 0x00007ffb643b58ef C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+01923311
[ 2] 0x00007ffb643b5a36 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+01923638
[ 3] 0x00007ffb64288681 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+00689793
[ 4] 0x00007ffb642887d0 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+00690128
[ 5] 0x00007ffb64289359 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+00693081
[ 6] 0x00007ffa88341627 D:\MATLAB\bin\win64\cudart64_110.dll+00136743
--------------------------------------------------------------------------------
Access violation detected at 2023-11-02 14:58:13 +0800
--------------------------------------------------------------------------------
Configuration:
Crash Decoding : Disabled - No sandbox or build area path
Crash Mode : continue (default)
Default Encoding : UTF-8
Deployed : false
Graphics Driver : Uninitialized hardware
Graphics card 1 : NVIDIA ( 0x10de ) NVIDIA GeForce RTX 3060 Laptop GPU Version 31.0.15.3713 (2023-8-14)
Interpreter 0 : Executing request: 64657461696C2F44656661756C744D564D2E637070
Java Version : Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
MATLAB Architecture : win64
MATLAB Entitlement ID : 4205217
MATLAB Root : D:\MATLAB
MATLAB Version : 9.14.0.2337262 (R2023a) Update 5
OpenGL : hardware
Operating System : Microsoft Windows 11 家庭中文版
Process ID : 12036
Processor ID : x86 Family 6 Model 154 Stepping 3, GenuineIntel
Session Key : 1bc022d5-fec1-4d20-85e6-804c291b35f9
Window System : Version 10.0 (Build 22621)
Fault Count: 1
Abnormal termination:
Access violation
Current Thread: 'MCR 0 interpreter thread' id 1504
Register State (from fault):
RAX = 0000000000000000 RBX = 000001f997c03600
RCX = 000001f997a46fa0 RDX = 000001f9917c6dd0
RSP = 0000000b249ef680 RBP = 000001f997a46fa0
RSI = 000001f90ed6cd10 RDI = 0000000000000000
R8 = 000001f93d278940 R9 = 0000000000000001
R10 = 0000000000000003 R11 = 0000000b249ef620
R12 = 0000000b249efab0 R13 = 0000000b249efb18
R14 = 0000000b249ef8b0 R15 = 000001f9917c6dd0
RIP = 00007ffb643e1c51 EFL = 00010246
CS = 0033 FS = 0053 GS = 002b
Stack Trace (from fault):
[ 0] 0x00007ffb643e1c51 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+02104401
[ 1] 0x00007ffb643b58ef C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+01923311
[ 2] 0x00007ffb643b5a36 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+01923638
[ 3] 0x00007ffb64288681 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+00689793
[ 4] 0x00007ffb642887d0 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+00690128
[ 5] 0x00007ffb64289359 C:\Windows\system32\DriverStore\FileRepository\nvlti.inf_amd64_106be4074dc4b9cb\nvcuda64.dll+00693081
[ 6] 0x00007ffa88341627 D:\MATLAB\bin\win64\cudart64_110.dll+00136743
错误使用 setup_stepping_kernel
Please set the GPU you're going to use by setting "sim.gpuDevice.Index".
出错 GMMNLSE_propagate_with_adaptive (213 )
sim.cuda_MPA_psi_update] = setup_stepping_kernel(sim,Nt,num_modes);
出错 GMMNLSE_propagate (74 )
foutput = GMMNLSE_propgation_func(fiber, initial_condition, sim, gain_rate_eqn);
出错 nonlinear_coupling (86 )
parfor i = 1:9
警告: 2 worker(s) crashed while executing code in the current parallel pool. MATLAB may attempt to run the code
again on the remaining workers of the pool, unless an spmd block has run. View the crash dump files to determine
what caused the workers to crash.

Accepted Answer

Walter Roberson
Walter Roberson on 2 Nov 2023
GMMNLSE_propgation_func must currently contain an invocation of a Simulink model. You need to configure that to run with sim() if it does not already do so. You have to set the sim object parameter gpuDevice.Index to the index of the gpu device. You must not start more such workers than you have GPUs. Only one worker at a time can use any given GPU.
Unfortunately you cannot just use the parfor index as the gpu index. Tracking which gpu are available can be a bit of a nuisance.
  7 Comments
Walter Roberson
Walter Roberson on 2 Nov 2023
sim.gpuDevice.Index is not something that I can find any reference to at the moment.
伟
on 3 Nov 2023
Hi, Walter Roberson. sim_e is a struct containing things like this:
sim_e(i).gpuDevice.Index is to choose which GPU to use. Since there is only one GPU in my computer, its value is always 1. And sim_e(i).ellipticity is the only changing variable in every single loop.
Below is where the error happens in the subfunction "setup_stepping_kernel". I guess the GPU is currently in use in another parfor thread, so it cannot be selected.
% Use the specified GPU
% This needs to run at the beginning; otherwise, the already-stored values
% in GPU will be unavailable in a new GPU if the GPU device is switched.
try
gpuDevice_Device = gpuDevice(sim.gpuDevice.Index); % use the specified GPU device
catch
error('Please set the GPU you''re going to use by setting "sim.gpuDevice.Index".');
end

Sign in to comment.

More Answers (1)

Joss Knight
Joss Knight on 3 Jan 2024
It looks like you just have a bug in your CUDAKernel implementation, probably accessing unallocated memory. This is putting the device in a bad state for subsequent GPU execution. Try using the NVIDIA compute sanitiser application to debug it.

Categories

Find more on Parallel and Cloud in Help Center and File Exchange

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!