Cannot train FasterRCNNObjectDetector on single GPU

Question

Angelo Dumitriu on 14 Oct 2018

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/423938-cannot-train-fasterrcnnobjectdetector-on-single-gpu

Despite the training options specifying the execution environment to 'gpu', the training does not execute on GPU but on CPU. My drivers are updated and MATLAB is R2018a.

When I specity 'cpu', the Command Window log states "Training on single CPU", but when I set to 'gpu', nothing shows. This is the log with 'gpu' as ExecutionEnvironment.

Training a Faster R-CNN Object Detector for the following object classes:
* ROI
Step 1 of 4: Training a Region Proposal Network (RPN).
|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |   Accuracy   |     RMSE     |      Rate       |
|========================================================================================|
|       1 |           1 |       00:00:08 |       28.00% |         0.81 |          0.0005 |
|       1 |          50 |       00:01:04 |       64.00% |         0.88 |          0.0005 |
|       2 |         100 |       00:01:58 |       72.00% |         0.90 |          0.0005 |
|       3 |         150 |       00:02:52 |       80.00% |         0.85 |          0.0005 |
|       4 |         200 |       00:03:46 |       80.00% |         0.89 |          0.0001 |
|       5 |         250 |       00:04:40 |       76.00% |         0.87 |          0.0001 |

The code:

count = gpuDeviceCount;
gpu1 = gpuDevice(1);
inputLayer = imageInputLayer([32 32 3]);
filterSize = [3 3];
numFilters = 32;
middleLayers = [
      convolution2dLayer(filterSize, numFilters, 'Padding', 1)   
      reluLayer()
      convolution2dLayer(filterSize, numFilters, 'Padding', 1)  
      reluLayer()
      maxPooling2dLayer(3, 'Stride',2)    
      ];
finalLayers = [
      fullyConnectedLayer(64)
      reluLayer()
      fullyConnectedLayer(width(trainingDatasetLight))
      softmaxLayer()
      classificationLayer()
  ];
layers = [
    inputLayer
    middleLayers
    finalLayers
    ];
optionsStage1 = trainingOptions('sgdm', ...
    'Momentum',0.7, ...
    'MaxEpochs', 15, ...
    'LearnRateSchedule','piecewise',...
    'LearnRateDropFactor',0.2,...
    'LearnRateDropPeriod',3,...
    'MiniBatchSize', 25, ...
    'InitialLearnRate', 5e-4, ...
    'CheckpointPath', tempdir, ...
    'ExecutionEnvironment', 'gpu');
optionsStage2 = trainingOptions('sgdm', ...
    'MaxEpochs', 10, ...
    'MiniBatchSize', 50, ...
    'LearnRateDropFactor',0.2, ...
    'LearnRateDropPeriod',2, ...
    'InitialLearnRate', 1e-3, ...
    'CheckpointPath', tempdir, ...
    'ExecutionEnvironment', 'gpu');
optionsStage3 = trainingOptions('sgdm', ...
    'MaxEpochs', 10, ...
    'MiniBatchSize', 50, ...
    'LearnRateDropFactor',0.2, ...
    'LearnRateDropPeriod',2, ...
    'InitialLearnRate', 1e-3, ...
    'CheckpointPath', tempdir, ...
    'ExecutionEnvironment', 'gpu');
optionsStage4 = trainingOptions('sgdm', ...
    'MaxEpochs', 10, ...
    'MiniBatchSize', 50, ...
    'LearnRateDropFactor',0.2, ...
    'LearnRateDropPeriod',2, ...
    'InitialLearnRate', 1e-3, ...
    'CheckpointPath', tempdir, ...
    'ExecutionEnvironment', 'gpu');
options = [
    optionsStage1
    optionsStage2
    optionsStage3
    optionsStage4
    ];
rng(0);
detector = trainFasterRCNNObjectDetector(trainingDatasetLight, layers, options, ... 
    'NegativeOverlapRange', [0 0.3], ... 
    'PositiveOverlapRange', [0.6 1], ... 
    'BoxPyramidScale', 1.2);

The GPU:

gpu1 = 
    CUDADevice with properties:
                        Name: 'GeForce 840M'
                       Index: 1
           ComputeCapability: '5.0'
              SupportsDouble: 1
               DriverVersion: 10
              ToolkitVersion: 9
          MaxThreadsPerBlock: 1024
            MaxShmemPerBlock: 49152
          MaxThreadBlockSize: [1024 1024 64]
                 MaxGridSize: [2.1475e+09 65535 65535]
                   SIMDWidth: 32
                 TotalMemory: 4.2950e+09
             AvailableMemory: 3.4248e+09
         MultiprocessorCount: 3
                ClockRateKHz: 1124000
                 ComputeMode: 'Default'
        GPUOverlapsTransfers: 1
      KernelExecutionTimeout: 1
            CanMapHostMemory: 1
             DeviceSupported: 1
              DeviceSelected: 1