Main Content

predict

Class: dlhdl.Workflow
Package: dlhdl

Predict responses by using deployed network

Since R2020b

Description

example

Y = predict(workflowObject,images) predicts responses for the image data, images, by using the deep learning network specified in the dlhdl.Workflow object, workflowObject.

example

Y = predict(workflowObject,X1,...,XN) predicts the responses for the data in the numeric or cell arrays X1, …, XN for the multi-input network specified in the Network argument of the workflowObject. The input XN corresponds to the workflowObject.Network.InputNames(N).

[Y1,...,YM] = predict(___) predicts responses for the M outputs of a multi-output network using any of the previous input arguments. The output YM corresponds to the output of the network specified in workflowObject.Network.OutputNames(M).

[Y,performance] = predict(___,Name,Value) predicts the responses with one or more arguments specified by optional name-value pair arguments.

Input Arguments

expand all

Workflow, specified as a dlhdl.Workflow object.

Input image, specified as a numeric array, cell array or formatted dlarray object. The numeric arrays can be 3-D or 4-D arrays. For 4-D arrays, the fourth dimension is the number of input images. If one of the members of the numeric array has four dimensions, then the other members of the numeric arrays must have four dimensions as well, with the value of the fourth dimension being the same for all members.

If the network specified in the dlhdl.Workflow object is a dlnetwork object, then the input image must be a formatted dlarray object. For more information about dlarray formats, see the fmt input argument of dlarray.

Data Types: single | int8

Numeric or cell arrays for networks with multiple inputs, specified as a numeric array or cell array.

For multiple inputs to image prediction networks, the format of the predictors must match the formats described in the images argument descriptions.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example:

Flag to return profiling results for the deep learning network deployed to the target board, specified as "off" or "on".

Example: Profile = "on"

Output Arguments

expand all

Predicted responses, returned as a numeric array. The format of Y depends on the type of task.

TaskFormat
2-D image regression
  • h-by-w-by-c-by-N numeric array, where h, w, and c are the height, width, and number of channels of the images, respectively, and N is the number of images

3-D image regression
  • h-by-w-by-d-by-c-by-N numeric array, where h, w, d, and c are the height, width, depth, and number of channels of the images, respectively, and N is the number of images

Sequence-to-one regressionN-by-R matrix, where N is the number of sequences and R is the number of responses
Sequence-to-sequence regressionN-by-R matrix, where N is the number of sequences and R is the number of responses
Feature regression

N-by-R matrix, where N is the number of observations and R is the number of responses

For sequence-to-sequence regression problems with one observation, images can be a matrix. In this case, Y is a matrix of responses.

If the output layer of the network is a classification layer, then Y is the predicted classification scores. This table describes the format of the scores for classification tasks.

TaskFormat
Image classificationN-by-K matrix, where N is the number of observations and K is the number of classes
Sequence-to-label classification
Feature classification

Predicted scores or responses of networks with multiple outputs, returned as numeric arrays.

Each output Yj corresponds to the network output net.OutputNames(j) and has format as described in the Y output argument.

Deployed network performance data, returned as an N-by-5 table, where N is the number of layers in the network. This method returns performance only when the Profile name-value argument is set to 'on'. To learn about the data in the performance table, see Profile Inference Run.

Examples

expand all

This example shows how to deploy a custom trained series network to detect pedestrians and bicyclists based on their micro-Doppler signatures. This network is taken from the Pedestrian and Bicyclist Classification Using Deep Learning example from the Phased Array Toolbox. For more details on network training and input data, see Pedestrian and Bicyclist Classification Using Deep Learning (Radar Toolbox).

Prerequisites

  • Xilinx™ Vivado™ Design Suite 2020.2

  • Zynq® UltraScale+™ MPSoC ZCU102 Evaluation Kit

  • HDL Verifier™ Support Package for XIlinx FPGA Boards

  • MATLAB™ Coder ™ Interface for Deep Learning Libraries

  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

The data files used in this example are:

  • The MAT File trainedNetBicPed.mat contains a model trained on training data set trainDataNoCar and its label set trainLabelNoCar.

  • The MAT File testDataBicPed.mat contains the test data set testDataNoCar and its label set testLabelNoCar.

Load Data and Network

Load a pretrained network. Load test data and its labels.

load('trainedNetBicPed.mat','trainedNetNoCar')
load('testDataBicPed.mat')

View the layers of the pre-trained series network

analyzeNetwork(trainedNetNoCar);

trainednetnocar_layers.png

Set up HDL Toolpath

Set up the path to your installed Xilinx™ Vivado™ Design Suite 2022.1 executable if it is not already set up. For example, to set the toolpath, enter:

% hdlsetuptoolpath('ToolName', 'Xilinx Vivado','ToolPath', 'C:\Vivado\2022.1\bin');

Create Target Object

Create a target object for your target device with a vendor name and an interface to connect your target device to the host computer. Interface options are JTAG (default) and Ethernet. Vendor options are Intel or Xilinx. Use the installed Xilinx Vivado Design Suite over an Ethernet connection to program the device.

hT = dlhdl.Target('Xilinx', 'Interface', 'Ethernet');

Create Workflow Object

Create an object of the dlhdl.Workflow class. When you create the object, specify the network and the bitstream name. Specify the saved pre-trained series network, trainedNetNoCar, as the network. Make sure the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Zynq UltraScale+ MPSoC ZCU102 board. The bitstream uses a single data type.

hW = dlhdl.Workflow('Network', trainedNetNoCar, 'Bitstream', 'zcu102_single', 'Target', hT);

Compile trainedNetNoCar Series Network

To compile the trainedNetNoCar series network, run the compile function of the dlhdl.Workflow object .

dn = hW.compile;
### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "28.0 MB"       
    "OutputResultOffset"        "0x01c00000"     "4.0 MB"        
    "SystemBufferOffset"        "0x02000000"     "28.0 MB"       
    "InstructionDataOffset"     "0x03c00000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x04000000"     "4.0 MB"        
    "FCWeightDataOffset"        "0x04400000"     "4.0 MB"        
    "EndOffset"                 "0x04800000"     "Total: 72.0 MB"

Program the Bitstream onto FPGA and Download Network Weights

To deploy the network on the Zynq® UltraScale+™ MPSoC ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file.The function also downloads the network weights and biases. The deploy function checks for the Xilinx Vivado tool and the supported tool version. It then starts programming the FPGA device by using the bitstream, displays progress messages and the time it takes to deploy the network.

hW.deploy;
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.
### Deep learning network programming has been skipped as the same network is already loaded on the target FPGA.

Run Predictions on Micro-Doppler Signatures

Classify one input from the sample test data set by using the predict function of the dlhdl.Workflow object and display the label. The inputs to the network correspond to the sonograms of the micro-Doppler signatures for a pedestrian or a bicyclist or a combination of both.

testImg = single(testDataNoCar(:, :, :, 1));
testLabel = testLabelNoCar(1);
classnames = trainedNetNoCar.Layers(end).Classes;

% Get predictions from network on single test input
score = hW.predict(testImg, 'Profile', 'On')
### Finished writing input activations.
### Running single input activations.


              Deep Learning Processor Profiler Performance Results

                   LastLayerLatency(cycles)   LastLayerLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    9430692                  0.04287                       1            9430707             23.3
    conv_module            9411355                  0.04278 
        conv_1             4178753                  0.01899 
        maxpool_1          1394883                  0.00634 
        conv_2             1975197                  0.00898 
        maxpool_2           706156                  0.00321 
        conv_3              813598                  0.00370 
        maxpool_3           121790                  0.00055 
        conv_4              148165                  0.00067 
        maxpool_4            22255                  0.00010 
        conv_5               41999                  0.00019 
        avgpool2d             8674                  0.00004 
    fc_module                19337                  0.00009 
        fc                   19337                  0.00009 
 * The clock frequency of the DL processor is: 220MHz
score = 1×5 single row vector

    0.9956    0.0000    0.0000    0.0044    0.0000

[~, idx1] = max(score);
predTestLabel = classnames(idx1)
predTestLabel = categorical
     ped 

Load five random images from the sample test data set and execute the predict function of the dlhdl.Workflow object to display the labels alongside the signatures. The predictions will happen at once since the input is concatenated along the fourth dimension.

numTestFrames = size(testDataNoCar, 4);
numView = 5;
listIndex = randperm(numTestFrames, numView);
testImgBatch = single(testDataNoCar(:, :, :, listIndex));
testLabelBatch = testLabelNoCar(listIndex);

% Get predictions from network using DL HDL Toolbox on FPGA
[scores, speed] = hW.predict(testImgBatch, 'Profile', 'On');
### Finished writing input activations.
### Running single input activations.


              Deep Learning Processor Profiler Performance Results

                   LastLayerLatency(cycles)   LastLayerLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    9446929                  0.04294                       5           47138869             23.3
    conv_module            9427488                  0.04285 
        conv_1             4195175                  0.01907 
        maxpool_1          1394705                  0.00634 
        conv_2             1975204                  0.00898 
        maxpool_2           706332                  0.00321 
        conv_3              813499                  0.00370 
        maxpool_3           121869                  0.00055 
        conv_4              148063                  0.00067 
        maxpool_4            22019                  0.00010 
        conv_5               42053                  0.00019 
        avgpool2d             8684                  0.00004 
    fc_module                19441                  0.00009 
        fc                   19441                  0.00009 
 * The clock frequency of the DL processor is: 220MHz
[~, idx2] = max(scores, [], 2);
predTestLabelBatch = classnames(idx2);

% Display the micro-doppler signatures along with the ground truth and
% predictions.
for k = 1:numView
    index = listIndex(k);
    imagesc(testDataNoCar(:, :, :, index));
    axis xy
    xlabel('Time (s)')
    ylabel('Frequency (Hz)')
    title('Ground Truth: '+string(testLabelNoCar(index))+', Prediction FPGA: '+string(predTestLabelBatch(k)))
    drawnow;
    pause(3);
end

The image shows the micro-Doppler signatures of two bicyclists (bic+bic) which is the ground truth. The ground truth is the classification of the image against which the network prediction is compared. The network prediction retrieved from the FPGA correctly predicts that the image has two bicyclists.

This example shows how to use Deep Learning HDL Toolbox™ to deploy a quantized deep convolutional neural network (CNN) to an FPGA. In the example you use the pretrained ResNet-18 CNN to perform transfer learning and quantization. You then deploy the quantized network and use MATLAB ® to retrieve the prediction results.

ResNet-18 has been trained on over a million images and can classify images into 1000 object categories, such as keyboard, coffee mug, pencil, and many animals. The network has learned rich feature representations for a wide range of images. The network takes an image as input and outputs a label for the object in the image together with the probabilities for each of the object categories.

For this example, you need:

  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

  • Deep Learning Toolbox Model for ResNet-18 Network

  • Deep Learning HDL Toolbox™ Support Package for Xilinx® FPGA and SoC Devices

  • Image Processing Toolbox™

  • Deep Learning Toolbox Model Quantization Library

  • MATLAB® Coder™ Interface for Deep Learning

To perform classification on a new set of images, you fine-tune a pretrained ResNet-18 CNN by transfer learning. In transfer learning, you can take a pretrained network and use it as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights. You can quickly transfer learned features to a new task using a smaller number of training images.

Load Pretrained Network

Load the pretrained ResNet-18 network.

net = resnet18;

View the layers of the pretrained network.

deepNetworkDesigner(net);

The first layer, the image input layer, requires input images of size 227-by-227-by-3, where three is the number of color channels.

inputSize = net.Layers(1).InputSize;

Load Data

This example uses the MathWorks MerchData data set. This is a small data set containing 75 images of MathWorks merchandise, belonging to five different classes (cap, cube, playing cards, screwdriver, and torch).

curDir = pwd;
unzip('MerchData.zip');
imds = imageDatastore('MerchData', ...
'IncludeSubfolders',true, ...
'LabelSource','foldernames');

Specify Training and Validation Sets

Divide the data into training and validation data sets, so that 30% percent of the images go to the training data set and 70% of the images to the validation data set. splitEachLabel splits the datastore imds into two new datastores, imdsTrain and imdsValidation.

[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');

Replace Final layers

To retrain ResNet-18 to classify new images, replace the last fully connected layer and final classification layer of the network. In ResNet-18 , these layers have the names 'fc1000' and 'ClassificationLayer_predictions', respectively. The fully connected layer and classification layer of the pretrained network net are configured for 1000 classes. These two layers fc1000 and ClassificationLayer_predictions in ResNet-18, contain information on how to combine the features that the network extracts into class probabilities and predicted labels. These two layers must be fine-tuned for the new classification problem. Extract all the layers, except the last two layers, from the pretrained network.

lgraph = layerGraph(net)
lgraph = 
  LayerGraph with properties:

     InputNames: {'data'}
    OutputNames: {'ClassificationLayer_predictions'}
         Layers: [71×1 nnet.cnn.layer.Layer]
    Connections: [78×2 table]

numClasses = numel(categories(imdsTrain.Labels))
numClasses = 5
newLearnableLayer = fullyConnectedLayer(numClasses, ...
'Name','new_fc', ...
'WeightLearnRateFactor',10, ...
'BiasLearnRateFactor',10);
lgraph = replaceLayer(lgraph,'fc1000',newLearnableLayer);
newClassLayer = classificationLayer('Name','new_classoutput');
lgraph = replaceLayer(lgraph,'ClassificationLayer_predictions',newClassLayer);

Prepare Data for Training

The network requires input images of size 224-by-224-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.

pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
'RandXReflection',true, ...
'RandXTranslation',pixelRange, ...
'RandYTranslation',pixelRange);

To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.

augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
'DataAugmentation',imageAugmenter);
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);

Specify Training Options

Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency iterations during training.

options = trainingOptions('sgdm', ...
'MiniBatchSize',10, ...
'MaxEpochs',6, ...
'InitialLearnRate',1e-4, ...
'Shuffle','every-epoch', ...
'ValidationData',augimdsValidation, ...
'ValidationFrequency',3, ...
'Verbose',false, ...
'Plots','training-progress');

Train Network

Train the network that consists of the transferred and new layers. By default, trainNetwork uses a GPU if one is available. Using this function on a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Computing Requirements (Parallel Computing Toolbox). If a GPU is not available, the network uses a CPU (requires MATLAB Coder Interface for Deep learning). You can also specify the execution environment by using the ExecutionEnvironment name-value argument of trainingOptions.

netTransfer = trainNetwork(augimdsTrain,lgraph,options);

Quantize Network

Quantize the network using the dlquantizer object. Set the target execution environment to FPGA.

dlquantObj = dlquantizer(netTransfer,'ExecutionEnvironment','FPGA');

Calibrate Quantized Network

Use the calibrate function to exercise the network with sample inputs and collect the range information. The calibrate function collects the dynamic ranges of the weights and biases in the convolution and fully connected layers of the network and the dynamic ranges of the activations in all layers of the network. The function returns the information as a table, in which each row contains range information for a learnable parameter of the quantized network.

calibrate(dlquantObj,augimdsTrain)
ans=95×5 table
       Optimized Layer Name       Network Layer Name    Learnables / Activations    MinValue    MaxValue
    __________________________    __________________    ________________________    ________    ________

    {'conv1_Weights'         }    {'conv1'         }           "Weights"            -0.79143     1.2547 
    {'conv1_Bias'            }    {'conv1'         }           "Bias"               -0.66949    0.67671 
    {'res2a_branch2a_Weights'}    {'res2a_branch2a'}           "Weights"            -0.42074    0.34251 
    {'res2a_branch2a_Bias'   }    {'res2a_branch2a'}           "Bias"                -0.8039     1.2488 
    {'res2a_branch2b_Weights'}    {'res2a_branch2b'}           "Weights"            -0.78524    0.59222 
    {'res2a_branch2b_Bias'   }    {'res2a_branch2b'}           "Bias"                -1.3835     1.7661 
    {'res2b_branch2a_Weights'}    {'res2b_branch2a'}           "Weights"             -0.3174    0.33645 
    {'res2b_branch2a_Bias'   }    {'res2b_branch2a'}           "Bias"                -1.1203     1.5238 
    {'res2b_branch2b_Weights'}    {'res2b_branch2b'}           "Weights"             -1.1915    0.93059 
    {'res2b_branch2b_Bias'   }    {'res2b_branch2b'}           "Bias"               -0.81928     1.2022 
    {'res3a_branch2a_Weights'}    {'res3a_branch2a'}           "Weights"            -0.19735    0.22659 
    {'res3a_branch2a_Bias'   }    {'res3a_branch2a'}           "Bias"               -0.53009    0.69532 
    {'res3a_branch2b_Weights'}    {'res3a_branch2b'}           "Weights"            -0.53557    0.72768 
    {'res3a_branch2b_Bias'   }    {'res3a_branch2b'}           "Bias"               -0.67756     1.1733 
    {'res3a_branch1_Weights' }    {'res3a_branch1' }           "Weights"            -0.63395    0.97791 
    {'res3a_branch1_Bias'    }    {'res3a_branch1' }           "Bias"               -0.95277    0.75618 
      ⋮

Define FPGA Board Interface

Define the target FPGA board programming interface by using the dlhdl.Target object. Create a programming interface with custom name for your target device and an Ethernet interface to connect the target device to the host computer.

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Prepare Network for Deployment

Prepare the network for deployment by creating a dlhdl.Workflow object. Specify the network and bitstream name. Ensure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx® Zynq® UltraScale+™ MPSoC ZCU102 board and the bitstream uses the int8 data type.

hW = dlhdl.Workflow(Network=dlquantObj,Bitstream='zcu102_int8',Target=hTarget);

Compile Network

Run the compile method of the dlhdl.Workflow object to compile the network and generate the instructions, weights, and biases for deployment.

dn = compile(hW,'InputFrameNumberLimit',15)
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### The network includes the following layers:
     1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                 2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
     4   'pool1'                 2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     5   'res2a_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     6   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
     7   'res2a_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     8   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
     9   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    10   'res2b_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    11   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    12   'res2b_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    13   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    14   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    15   'res3a_branch2a'        2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    16   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    17   'res3a_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    18   'res3a_branch1'         2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    19   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    20   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    21   'res3b_branch2a'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    22   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    23   'res3b_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    25   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    26   'res4a_branch2a'        2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    27   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    28   'res4a_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    29   'res4a_branch1'         2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    30   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    31   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    32   'res4b_branch2a'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    34   'res4b_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    35   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    36   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    37   'res5a_branch2a'        2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    38   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    39   'res5a_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'res5a_branch1'         2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    41   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    43   'res5b_branch2a'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    44   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    45   'res5b_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    47   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    48   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    49   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
    50   'prob'                  Softmax                      softmax                                                               (SW Layer)
    51   'new_classoutput'       Classification Output        crossentropyex with 'MathWorks Cap' and 4 other classes               (SW Layer)
                                                                                                                                  
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
### Compiling layer group: conv1>>pool1 ...
### Compiling layer group: conv1>>pool1 ... complete.
### Compiling layer group: res2a_branch2a>>res2a_branch2b ...
### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.
### Compiling layer group: res2b_branch2a>>res2b_branch2b ...
### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.
### Compiling layer group: res3a_branch1 ...
### Compiling layer group: res3a_branch1 ... complete.
### Compiling layer group: res3a_branch2a>>res3a_branch2b ...
### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.
### Compiling layer group: res3b_branch2a>>res3b_branch2b ...
### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.
### Compiling layer group: res4a_branch1 ...
### Compiling layer group: res4a_branch1 ... complete.
### Compiling layer group: res4a_branch2a>>res4a_branch2b ...
### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.
### Compiling layer group: res4b_branch2a>>res4b_branch2b ...
### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.
### Compiling layer group: res5a_branch1 ...
### Compiling layer group: res5a_branch1 ... complete.
### Compiling layer group: res5a_branch2a>>res5a_branch2b ...
### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.
### Compiling layer group: res5b_branch2a>>res5b_branch2b ...
### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.
### Compiling layer group: pool5 ...
### Compiling layer group: pool5 ... complete.
### Compiling layer group: new_fc ...
### Compiling layer group: new_fc ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "8.0 MB"        
    "OutputResultOffset"        "0x00800000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x00c00000"     "4.0 MB"        
    "SystemBufferOffset"        "0x01000000"     "28.0 MB"       
    "InstructionDataOffset"     "0x02c00000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x03000000"     "16.0 MB"       
    "FCWeightDataOffset"        "0x04000000"     "4.0 MB"        
    "EndOffset"                 "0x04400000"     "Total: 68.0 MB"

### Network compilation complete.
dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]
        constantData: {}
             ddrInfo: [1×1 struct]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.

deploy(hW)
### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_int8.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_int8.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Rebooting Xilinx SoC at 192.168.1.101...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 21-Dec-2022 10:45:19
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 21-Dec-2022 10:45:19

Test Network

Load the example image.

imgFile = fullfile(pwd,'MerchData','MathWorks Cube','Mathworks cube_0.jpg');
inputImg = imresize(imread(imgFile),[224 224]);
imshow(inputImg)

Classify the image on the FPGA by using the predict method of the dlhdl.Workflow object and display the results.

[prediction,speed] = predict(hW,single(inputImg),'Profile','on');
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    7392114                  0.02957                       1            7394677             33.8
    conv1                  1115165                  0.00446 
    pool1                   199164                  0.00080 
    res2a_branch2a          270125                  0.00108 
    res2a_branch2b          269946                  0.00108 
    res2a                   102255                  0.00041 
    res2b_branch2a          269792                  0.00108 
    res2b_branch2b          269902                  0.00108 
    res2b                   102695                  0.00041 
    res3a_branch1           155120                  0.00062 
    res3a_branch2a          156480                  0.00063 
    res3a_branch2b          244913                  0.00098 
    res3a                    51456                  0.00021 
    res3b_branch2a          245366                  0.00098 
    res3b_branch2b          245123                  0.00098 
    res3b                    51286                  0.00021 
    res4a_branch1           135535                  0.00054 
    res4a_branch2a          136117                  0.00054 
    res4a_branch2b          238454                  0.00095 
    res4a                    25602                  0.00010 
    res4b_branch2a          237909                  0.00095 
    res4b_branch2b          238282                  0.00095 
    res4b                    26742                  0.00011 
    res5a_branch1           324642                  0.00130 
    res5a_branch2a          325897                  0.00130 
    res5a_branch2b          623521                  0.00249 
    res5a                    13881                  0.00006 
    res5b_branch2a          624028                  0.00250 
    res5b_branch2b          624631                  0.00250 
    res5b                    13051                  0.00005 
    pool5                    37083                  0.00015 
    new_fc                   17764                  0.00007 
 * The clock frequency of the DL processor is: 250MHz
[val,idx] = max(prediction);
dlquantObj.NetworkObject.Layers(end).ClassNames{idx}
ans = 
'MathWorks Cube'

Performance Comparison

Compare the performance of the quantized network to the performance of the single data type network.

optionsFPGA = dlquantizationOptions('Bitstream','zcu102_int8','Target',hTarget);
predictionFPGA = validate(dlquantObj,imdsValidation,optionsFPGA)
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### The network includes the following layers:
     1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                 2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
     4   'pool1'                 2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     5   'res2a_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     6   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
     7   'res2a_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     8   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
     9   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    10   'res2b_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    11   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    12   'res2b_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    13   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    14   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    15   'res3a_branch2a'        2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    16   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    17   'res3a_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    18   'res3a_branch1'         2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    19   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    20   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    21   'res3b_branch2a'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    22   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    23   'res3b_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    25   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    26   'res4a_branch2a'        2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    27   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    28   'res4a_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    29   'res4a_branch1'         2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    30   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    31   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    32   'res4b_branch2a'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    34   'res4b_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    35   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    36   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    37   'res5a_branch2a'        2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    38   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    39   'res5a_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'res5a_branch1'         2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    41   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    43   'res5b_branch2a'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    44   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    45   'res5b_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    47   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    48   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    49   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
    50   'prob'                  Softmax                      softmax                                                               (SW Layer)
    51   'new_classoutput'       Classification Output        crossentropyex with 'MathWorks Cap' and 4 other classes               (SW Layer)
                                                                                                                                  
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.
### Compiling layer group: conv1>>pool1 ...
### Compiling layer group: conv1>>pool1 ... complete.
### Compiling layer group: res2a_branch2a>>res2a_branch2b ...
### Compiling layer group: res2a_branch2a>>res2a_branch2b ... complete.
### Compiling layer group: res2b_branch2a>>res2b_branch2b ...
### Compiling layer group: res2b_branch2a>>res2b_branch2b ... complete.
### Compiling layer group: res3a_branch1 ...
### Compiling layer group: res3a_branch1 ... complete.
### Compiling layer group: res3a_branch2a>>res3a_branch2b ...
### Compiling layer group: res3a_branch2a>>res3a_branch2b ... complete.
### Compiling layer group: res3b_branch2a>>res3b_branch2b ...
### Compiling layer group: res3b_branch2a>>res3b_branch2b ... complete.
### Compiling layer group: res4a_branch1 ...
### Compiling layer group: res4a_branch1 ... complete.
### Compiling layer group: res4a_branch2a>>res4a_branch2b ...
### Compiling layer group: res4a_branch2a>>res4a_branch2b ... complete.
### Compiling layer group: res4b_branch2a>>res4b_branch2b ...
### Compiling layer group: res4b_branch2a>>res4b_branch2b ... complete.
### Compiling layer group: res5a_branch1 ...
### Compiling layer group: res5a_branch1 ... complete.
### Compiling layer group: res5a_branch2a>>res5a_branch2b ...
### Compiling layer group: res5a_branch2a>>res5a_branch2b ... complete.
### Compiling layer group: res5b_branch2a>>res5b_branch2b ...
### Compiling layer group: res5b_branch2a>>res5b_branch2b ... complete.
### Compiling layer group: pool5 ...
### Compiling layer group: pool5 ... complete.
### Compiling layer group: new_fc ...
### Compiling layer group: new_fc ... complete.

### Allocating external memory buffers:

          offset_name          offset_address    allocated_space 
    _______________________    ______________    ________________

    "InputDataOffset"           "0x00000000"     "12.0 MB"       
    "OutputResultOffset"        "0x00c00000"     "4.0 MB"        
    "SchedulerDataOffset"       "0x01000000"     "4.0 MB"        
    "SystemBufferOffset"        "0x01400000"     "28.0 MB"       
    "InstructionDataOffset"     "0x03000000"     "4.0 MB"        
    "ConvWeightDataOffset"      "0x03400000"     "16.0 MB"       
    "FCWeightDataOffset"        "0x04400000"     "4.0 MB"        
    "EndOffset"                 "0x04800000"     "Total: 72.0 MB"

### Network compilation complete.

### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 21-Dec-2022 10:46:36
### Loading weights to FC Processor.
### FC Weights loaded. Current time is 21-Dec-2022 10:46:36
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Bitstream Build Info

Resource                   Utilized           Total        Percentage
------------------        ----------      ------------    ------------
LUTs (CLB/ALM)*              249703            274080           91.11
DSPs                            391              2520           15.52
Block RAM                       583               912           63.93
* LUT count represents Configurable Logic Block(CLB) utilization in Xilinx devices and Adaptive Logic Module (ALM) utilization in Intel devices.

### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### Notice: The layer 'data' of type 'ImageInputLayer' is split into an image input layer 'data', an addition layer 'data_norm_add', and a multiplication layer 'data_norm' for hardware normalization.
### The network includes the following layers:
     1   'data'                  Image Input                  224×224×3 images with 'zscore' normalization                          (SW Layer)
     2   'conv1'                 2-D Convolution              64 7×7×3 convolutions with stride [2  2] and padding [3  3  3  3]     (HW Layer)
     3   'conv1_relu'            ReLU                         ReLU                                                                  (HW Layer)
     4   'pool1'                 2-D Max Pooling              3×3 max pooling with stride [2  2] and padding [1  1  1  1]           (HW Layer)
     5   'res2a_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     6   'res2a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
     7   'res2a_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
     8   'res2a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
     9   'res2a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    10   'res2b_branch2a'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    11   'res2b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    12   'res2b_branch2b'        2-D Convolution              64 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]    (HW Layer)
    13   'res2b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    14   'res2b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    15   'res3a_branch2a'        2-D Convolution              128 3×3×64 convolutions with stride [2  2] and padding [1  1  1  1]   (HW Layer)
    16   'res3a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    17   'res3a_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    18   'res3a_branch1'         2-D Convolution              128 1×1×64 convolutions with stride [2  2] and padding [0  0  0  0]   (HW Layer)
    19   'res3a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    20   'res3a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    21   'res3b_branch2a'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    22   'res3b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    23   'res3b_branch2b'        2-D Convolution              128 3×3×128 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    24   'res3b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    25   'res3b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    26   'res4a_branch2a'        2-D Convolution              256 3×3×128 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    27   'res4a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    28   'res4a_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    29   'res4a_branch1'         2-D Convolution              256 1×1×128 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    30   'res4a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    31   'res4a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    32   'res4b_branch2a'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    33   'res4b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    34   'res4b_branch2b'        2-D Convolution              256 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    35   'res4b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    36   'res4b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    37   'res5a_branch2a'        2-D Convolution              512 3×3×256 convolutions with stride [2  2] and padding [1  1  1  1]  (HW Layer)
    38   'res5a_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    39   'res5a_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'res5a_branch1'         2-D Convolution              512 1×1×256 convolutions with stride [2  2] and padding [0  0  0  0]  (HW Layer)
    41   'res5a'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    42   'res5a_relu'            ReLU                         ReLU                                                                  (HW Layer)
    43   'res5b_branch2a'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    44   'res5b_branch2a_relu'   ReLU                         ReLU                                                                  (HW Layer)
    45   'res5b_branch2b'        2-D Convolution              512 3×3×512 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    46   'res5b'                 Addition                     Element-wise addition of 2 inputs                                     (HW Layer)
    47   'res5b_relu'            ReLU                         ReLU                                                                  (HW Layer)
    48   'pool5'                 2-D Global Average Pooling   2-D global average pooling                                            (HW Layer)
    49   'new_fc'                Fully Connected              5 fully connected layer                                               (HW Layer)
    50   'prob'                  Softmax                      softmax                                                               (SW Layer)
    51   'new_classoutput'       Classification Output        crossentropyex with 'MathWorks Cap' and 4 other classes               (SW Layer)
                                                                                                                                  
### Notice: The layer 'prob' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'new_classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   23502752                  0.10683                       1           23502752              9.4
    data_norm_add           210750                  0.00096 
    data_norm               210750                  0.00096 
    conv1                  2164124                  0.00984 
    pool1                   515064                  0.00234 
    res2a_branch2a          966221                  0.00439 
    res2a_branch2b          966221                  0.00439 
    res2a                   210750                  0.00096 
    res2b_branch2a          966221                  0.00439 
    res2b_branch2b          966221                  0.00439 
    res2b                   210750                  0.00096 
    res3a_branch1           540861                  0.00246 
    res3a_branch2a          540749                  0.00246 
    res3a_branch2b          919117                  0.00418 
    res3a                   105404                  0.00048 
    res3b_branch2a          919117                  0.00418 
    res3b_branch2b          919117                  0.00418 
    res3b                   105404                  0.00048 
    res4a_branch1           503405                  0.00229 
    res4a_branch2a          509261                  0.00231 
    res4a_branch2b          905421                  0.00412 
    res4a                    52724                  0.00024 
    res4b_branch2a          905421                  0.00412 
    res4b_branch2b          905421                  0.00412 
    res4b                    52724                  0.00024 
    res5a_branch1          1039437                  0.00472 
    res5a_branch2a         1046605                  0.00476 
    res5a_branch2b         2005197                  0.00911 
    res5a                    26368                  0.00012 
    res5b_branch2a         2005197                  0.00911 
    res5b_branch2b         2005197                  0.00911 
    res5b                    26368                  0.00012 
    pool5                    54594                  0.00025 
    new_fc                   22571                  0.00010 
 * The clock frequency of the DL processor is: 220MHz




              Deep Learning Processor Bitstream Build Info

Resource                   Utilized           Total        Percentage
------------------        ----------      ------------    ------------
LUTs (CLB/ALM)*              168099            274080           61.33
DSPs                            807              2520           32.02
Block RAM                       453               912           49.67
* LUT count represents Configurable Logic Block(CLB) utilization in Xilinx devices and Adaptive Logic Module (ALM) utilization in Intel devices.

### Finished writing input activations.
### Running single input activation.
predictionFPGA = struct with fields:
       NumSamples: 20
    MetricResults: [1×1 struct]
       Statistics: [2×7 table]

View the frames per second performance for the quantized network and single-data-type network. The quantized network has a performance of 33.8 frames per second compared to 9.2 frames per second for the single-data-type network. You can use quantization to improve your frames per second performance, however yo could lose accuracy when you quantize your networks.

predictionFPGA.Statistics.FramesPerSecond
ans = 2×1

    9.3606
   33.7719

This example shows how to deploy a trained you only look once (YOLO) v3 object detector to a target FPGA board. You then use MATLAB® to retrieve the object classification from the FPGA board.

Compared to YOLO v2 networks, YOLO v3 networks have additional detection heads that help detect smaller objects.

Create YOLO v3 Detector Object

In this example, you use a pretrained YOLO v3 object detector. To construct and train a custom YOLO v3 detector, see Object Detection Using YOLO v3 Deep Learning (Computer Vision Toolbox).

Use the downloadPretrainedYOLOv3Detector function to generate a dlnetwork object. To get the code for this function, see the downloadPretrainedYOLOv3Detector Function section.

preTrainedDetector = downloadPretrainedYOLOv3Detector;
Downloaded pretrained detector

The generated network uses training data to estimate the anchor boxes, which help the detector learn to predict the boxes. For more information about anchor boxes, see Anchor Boxes for Object Detection (Computer Vision Toolbox). The downloadPretrainedYOLOv3Detector function creates this YOLO v3 network:

Load the Pretrained network

Extract the network from the pretrained YOLO v3 detector object.

yolov3Detector = preTrainedDetector;
net = yolov3Detector.Network;

Extract the attributes of the network as variables.

anchorBoxes = yolov3Detector.AnchorBoxes;
outputNames = yolov3Detector.Network.OutputNames;
inputSize = yolov3Detector.InputSize;
classNames = yolov3Detector.ClassNames;

Use the analyzeNetwork function to obtain information about the network layers. the function returns a graphical representation of the network that contains detailed parameter information for every layer in the network.

analyzeNetwork(net);

Define FPGA Board Interface

Define the target FPGA board programming interface by using the dlhdl.Target object. Create a programming interface with custom name for your target device and an Ethernet interface to connect the target device to the host computer.

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Prepare Network for Deployment

Prepare the network for deployment by creating a dlhdl.Workflow object. Specify the network and bitstream name. Ensure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx® Zynq® UltraScale+™ MPSoC ZCU102 board and the bitstream uses the single data type.

hW = dlhdl.Workflow('Network',net,'Bitstream','zcu102_single','Target',hTarget);

Compile Network

Run the compile method of the dlhdl.Workflow object to compile the network and generate the instructions, weights, and biases for deployment.

dn = compile(hW);
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_single.
### An output layer called 'Output1_customOutputConv1' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### An output layer called 'Output2_customOutputConv2' of type 'nnet.cnn.layer.RegressionOutputLayer' has been added to the provided network. This layer performs no operation during prediction and thus does not affect the output of the network.
### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
### The network includes the following layers:
     1   'data'                        Image Input                    227×227×3 images                                                     (SW Layer)
     2   'conv1'                       2-D Convolution                64 3×3×3 convolutions with stride [2  2] and padding [0  0  0  0]    (HW Layer)
     3   'relu_conv1'                  ReLU                           ReLU                                                                 (HW Layer)
     4   'pool1'                       2-D Max Pooling                3×3 max pooling with stride [2  2] and padding [0  0  0  0]          (HW Layer)
     5   'fire2-squeeze1x1'            2-D Convolution                16 1×1×64 convolutions with stride [1  1] and padding [0  0  0  0]   (HW Layer)
     6   'fire2-relu_squeeze1x1'       ReLU                           ReLU                                                                 (HW Layer)
     7   'fire2-expand1x1'             2-D Convolution                64 1×1×16 convolutions with stride [1  1] and padding [0  0  0  0]   (HW Layer)
     8   'fire2-relu_expand1x1'        ReLU                           ReLU                                                                 (HW Layer)
     9   'fire2-expand3x3'             2-D Convolution                64 3×3×16 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
    10   'fire2-relu_expand3x3'        ReLU                           ReLU                                                                 (HW Layer)
    11   'fire2-concat'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    12   'fire3-squeeze1x1'            2-D Convolution                16 1×1×128 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    13   'fire3-relu_squeeze1x1'       ReLU                           ReLU                                                                 (HW Layer)
    14   'fire3-expand1x1'             2-D Convolution                64 1×1×16 convolutions with stride [1  1] and padding [0  0  0  0]   (HW Layer)
    15   'fire3-relu_expand1x1'        ReLU                           ReLU                                                                 (HW Layer)
    16   'fire3-expand3x3'             2-D Convolution                64 3×3×16 convolutions with stride [1  1] and padding [1  1  1  1]   (HW Layer)
    17   'fire3-relu_expand3x3'        ReLU                           ReLU                                                                 (HW Layer)
    18   'fire3-concat'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    19   'pool3'                       2-D Max Pooling                3×3 max pooling with stride [2  2] and padding [0  1  0  1]          (HW Layer)
    20   'fire4-squeeze1x1'            2-D Convolution                32 1×1×128 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    21   'fire4-relu_squeeze1x1'       ReLU                           ReLU                                                                 (HW Layer)
    22   'fire4-expand1x1'             2-D Convolution                128 1×1×32 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    23   'fire4-relu_expand1x1'        ReLU                           ReLU                                                                 (HW Layer)
    24   'fire4-expand3x3'             2-D Convolution                128 3×3×32 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    25   'fire4-relu_expand3x3'        ReLU                           ReLU                                                                 (HW Layer)
    26   'fire4-concat'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    27   'fire5-squeeze1x1'            2-D Convolution                32 1×1×256 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    28   'fire5-relu_squeeze1x1'       ReLU                           ReLU                                                                 (HW Layer)
    29   'fire5-expand1x1'             2-D Convolution                128 1×1×32 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    30   'fire5-relu_expand1x1'        ReLU                           ReLU                                                                 (HW Layer)
    31   'fire5-expand3x3'             2-D Convolution                128 3×3×32 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    32   'fire5-relu_expand3x3'        ReLU                           ReLU                                                                 (HW Layer)
    33   'fire5-concat'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    34   'pool5'                       2-D Max Pooling                3×3 max pooling with stride [2  2] and padding [0  1  0  1]          (HW Layer)
    35   'fire6-squeeze1x1'            2-D Convolution                48 1×1×256 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    36   'fire6-relu_squeeze1x1'       ReLU                           ReLU                                                                 (HW Layer)
    37   'fire6-expand1x1'             2-D Convolution                192 1×1×48 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    38   'fire6-relu_expand1x1'        ReLU                           ReLU                                                                 (HW Layer)
    39   'fire6-expand3x3'             2-D Convolution                192 3×3×48 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    40   'fire6-relu_expand3x3'        ReLU                           ReLU                                                                 (HW Layer)
    41   'fire6-concat'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    42   'fire7-squeeze1x1'            2-D Convolution                48 1×1×384 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    43   'fire7-relu_squeeze1x1'       ReLU                           ReLU                                                                 (HW Layer)
    44   'fire7-expand1x1'             2-D Convolution                192 1×1×48 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    45   'fire7-relu_expand1x1'        ReLU                           ReLU                                                                 (HW Layer)
    46   'fire7-expand3x3'             2-D Convolution                192 3×3×48 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    47   'fire7-relu_expand3x3'        ReLU                           ReLU                                                                 (HW Layer)
    48   'fire7-concat'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    49   'fire8-squeeze1x1'            2-D Convolution                64 1×1×384 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    50   'fire8-relu_squeeze1x1'       ReLU                           ReLU                                                                 (HW Layer)
    51   'fire8-expand1x1'             2-D Convolution                256 1×1×64 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    52   'fire8-relu_expand1x1'        ReLU                           ReLU                                                                 (HW Layer)
    53   'fire8-expand3x3'             2-D Convolution                256 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    54   'fire8-relu_expand3x3'        ReLU                           ReLU                                                                 (HW Layer)
    55   'fire8-concat'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    56   'fire9-squeeze1x1'            2-D Convolution                64 1×1×512 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    57   'fire9-relu_squeeze1x1'       ReLU                           ReLU                                                                 (HW Layer)
    58   'fire9-expand1x1'             2-D Convolution                256 1×1×64 convolutions with stride [1  1] and padding [0  0  0  0]  (HW Layer)
    59   'fire9-relu_expand1x1'        ReLU                           ReLU                                                                 (HW Layer)
    60   'fire9-expand3x3'             2-D Convolution                256 3×3×64 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    61   'fire9-relu_expand3x3'        ReLU                           ReLU                                                                 (HW Layer)
    62   'fire9-concat'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    63   'customConv1'                 2-D Convolution                1024 3×3×512 convolutions with stride [1  1] and padding 'same'      (HW Layer)
    64   'customRelu1'                 ReLU                           ReLU                                                                 (HW Layer)
    65   'customOutputConv1'           2-D Convolution                18 1×1×1024 convolutions with stride [1  1] and padding 'same'       (HW Layer)
    66   'featureConv2'                2-D Convolution                128 1×1×512 convolutions with stride [1  1] and padding 'same'       (HW Layer)
    67   'featureRelu2'                ReLU                           ReLU                                                                 (HW Layer)
    68   'Output1_customOutputConv1'   Regression Output              mean-squared-error                                                   (SW Layer)
    69   'featureResize2'              dnnfpga.custom.Resize2DLayer   dnnfpga.custom.Resize2DLayer                                         (HW Layer)
    70   'depthConcat2'                Depth concatenation            Depth concatenation of 2 inputs                                      (HW Layer)
    71   'customConv2'                 2-D Convolution                256 3×3×384 convolutions with stride [1  1] and padding 'same'       (HW Layer)
    72   'customRelu2'                 ReLU                           ReLU                                                                 (HW Layer)
    73   'customOutputConv2'           2-D Convolution                18 1×1×256 convolutions with stride [1  1] and padding 'same'        (HW Layer)
    74   'Output2_customOutputConv2'   Regression Output              mean-squared-error                                                   (SW Layer)
                                                                                                                                         
### Notice: The layer 'data' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'Output1_customOutputConv1' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.
### Notice: The layer 'Output2_customOutputConv2' with type 'nnet.cnn.layer.RegressionOutputLayer' is implemented in software.
### Compiling layer group: conv1>>fire2-relu_squeeze1x1 ...
### Compiling layer group: conv1>>fire2-relu_squeeze1x1 ... complete.
### Compiling layer group: fire2-expand1x1>>fire2-relu_expand1x1 ...
### Compiling layer group: fire2-expand1x1>>fire2-relu_expand1x1 ... complete.
### Compiling layer group: fire2-expand3x3>>fire2-relu_expand3x3 ...
### Compiling layer group: fire2-expand3x3>>fire2-relu_expand3x3 ... complete.
### Compiling layer group: fire3-squeeze1x1>>fire3-relu_squeeze1x1 ...
### Compiling layer group: fire3-squeeze1x1>>fire3-relu_squeeze1x1 ... complete.
### Compiling layer group: fire3-expand1x1>>fire3-relu_expand1x1 ...
### Compiling layer group: fire3-expand1x1>>fire3-relu_expand1x1 ... complete.
### Compiling layer group: fire3-expand3x3>>fire3-relu_expand3x3 ...
### Compiling layer group: fire3-expand3x3>>fire3-relu_expand3x3 ... complete.
### Compiling layer group: pool3>>fire4-relu_squeeze1x1 ...
### Compiling layer group: pool3>>fire4-relu_squeeze1x1 ... complete.
### Compiling layer group: fire4-expand1x1>>fire4-relu_expand1x1 ...
### Compiling layer group: fire4-expand1x1>>fire4-relu_expand1x1 ... complete.
### Compiling layer group: fire4-expand3x3>>fire4-relu_expand3x3 ...
### Compiling layer group: fire4-expand3x3>>fire4-relu_expand3x3 ... complete.
### Compiling layer group: fire5-squeeze1x1>>fire5-relu_squeeze1x1 ...
### Compiling layer group: fire5-squeeze1x1>>fire5-relu_squeeze1x1 ... complete.
### Compiling layer group: fire5-expand1x1>>fire5-relu_expand1x1 ...
### Compiling layer group: fire5-expand1x1>>fire5-relu_expand1x1 ... complete.
### Compiling layer group: fire5-expand3x3>>fire5-relu_expand3x3 ...
### Compiling layer group: fire5-expand3x3>>fire5-relu_expand3x3 ... complete.
### Compiling layer group: pool5>>fire6-relu_squeeze1x1 ...
### Compiling layer group: pool5>>fire6-relu_squeeze1x1 ... complete.
### Compiling layer group: fire6-expand1x1>>fire6-relu_expand1x1 ...
### Compiling layer group: fire6-expand1x1>>fire6-relu_expand1x1 ... complete.
### Compiling layer group: fire6-expand3x3>>fire6-relu_expand3x3 ...
### Compiling layer group: fire6-expand3x3>>fire6-relu_expand3x3 ... complete.
### Compiling layer group: fire7-squeeze1x1>>fire7-relu_squeeze1x1 ...
### Compiling layer group: fire7-squeeze1x1>>fire7-relu_squeeze1x1 ... complete.
### Compiling layer group: fire7-expand1x1>>fire7-relu_expand1x1 ...
### Compiling layer group: fire7-expand1x1>>fire7-relu_expand1x1 ... complete.
### Compiling layer group: fire7-expand3x3>>fire7-relu_expand3x3 ...
### Compiling layer group: fire7-expand3x3>>fire7-relu_expand3x3 ... complete.
### Compiling layer group: fire8-squeeze1x1>>fire8-relu_squeeze1x1 ...
### Compiling layer group: fire8-squeeze1x1>>fire8-relu_squeeze1x1 ... complete.
### Compiling layer group: fire8-expand1x1>>fire8-relu_expand1x1 ...
### Compiling layer group: fire8-expand1x1>>fire8-relu_expand1x1 ... complete.
### Compiling layer group: fire8-expand3x3>>fire8-relu_expand3x3 ...
### Compiling layer group: fire8-expand3x3>>fire8-relu_expand3x3 ... complete.
### Compiling layer group: fire9-squeeze1x1>>fire9-relu_squeeze1x1 ...
### Compiling layer group: fire9-squeeze1x1>>fire9-relu_squeeze1x1 ... complete.
### Compiling layer group: fire9-expand1x1>>fire9-relu_expand1x1 ...
### Compiling layer group: fire9-expand1x1>>fire9-relu_expand1x1 ... complete.
### Compiling layer group: fire9-expand3x3>>fire9-relu_expand3x3 ...
### Compiling layer group: fire9-expand3x3>>fire9-relu_expand3x3 ... complete.
### Compiling layer group: customConv1>>customOutputConv1 ...
### Compiling layer group: customConv1>>customOutputConv1 ... complete.
### Compiling layer group: featureConv2>>featureRelu2 ...
### Compiling layer group: featureConv2>>featureRelu2 ... complete.
### Compiling layer group: customConv2>>customOutputConv2 ...
### Compiling layer group: customConv2>>customOutputConv2 ... complete.

### Allocating external memory buffers:

          offset_name          offset_address     allocated_space 
    _______________________    ______________    _________________

    "InputDataOffset"           "0x00000000"     "24.0 MB"        
    "OutputResultOffset"        "0x01800000"     "4.0 MB"         
    "SchedulerDataOffset"       "0x01c00000"     "4.0 MB"         
    "SystemBufferOffset"        "0x02000000"     "28.0 MB"        
    "InstructionDataOffset"     "0x03c00000"     "8.0 MB"         
    "ConvWeightDataOffset"      "0x04400000"     "104.0 MB"       
    "EndOffset"                 "0x0ac00000"     "Total: 172.0 MB"

### Network compilation complete.

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx® Zynq® UltraScale+ MPSoC ZCU102 hardware, run the deploy method of the dlhdl.Workflow object. This method programs the FPGA board using the output of the compile method and the programming file, downloads the network weights and biases, displays progress messages, and the time it takes to deploy the network.

deploy(hW);
### Programming FPGA Bitstream using Ethernet...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101...
### Copying FPGA programming files to SD card...
### Setting FPGA bitstream and devicetree for boot...
# Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd
# Set Bitstream to hdlcoder_rd/zcu102_single.bit
# Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd
# Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb
# Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM'
### Rebooting Xilinx SoC at 192.168.1.101...
### Reboot may take several seconds...
### Attempting to connect to the hardware board at 192.168.1.101...
### Connection successful
### Programming the FPGA bitstream has been completed successfully.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 27-Oct-2022 13:44:50

Test Network

Load the example image and convert the image into a dlarray. Then classify the image on the FPGA by using the predict method of the dlhdl.Workflow object and display the results.

img = imread('vehicle_image.jpg'); 
I = single(rescale(img)); 
I = imresize(I, yolov3Detector.InputSize(1:2)); 
dlX = dlarray(I,'SSC');

Store the output of each detection head of the network in the features variable. Pass features to the post-processing function processYOLOv3Ouputs to combine the multiple outputs and compute the final results. To get the code for this function, see the processYOLOv3Output Function section.

features = cell(size(net.OutputNames'));
[features{:}] = hW.predict(dlX, 'Profiler', 'on');
### Finished writing input activations.
### Running single input activation.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   34469645                  0.15668                       1           34473970              6.4
    conv1                   673148                  0.00306 
    pool1                   509022                  0.00231 
    fire2-squeeze1x1        308280                  0.00140 
    fire2-expand1x1         305546                  0.00139 
    fire2-expand3x3         305227                  0.00139 
    fire3-squeeze1x1        628018                  0.00285 
    fire3-expand1x1         305219                  0.00139 
    fire3-expand3x3         305220                  0.00139 
    pool3                   286781                  0.00130 
    fire4-squeeze1x1        264346                  0.00120 
    fire4-expand1x1         264777                  0.00120 
    fire4-expand3x3         264750                  0.00120 
    fire5-squeeze1x1        749166                  0.00341 
    fire5-expand1x1         264800                  0.00120 
    fire5-expand3x3         264880                  0.00120 
    pool5                   219686                  0.00100 
    fire6-squeeze1x1        195193                  0.00089 
    fire6-expand1x1         145091                  0.00066 
    fire6-expand3x3         145075                  0.00066 
    fire7-squeeze1x1        290001                  0.00132 
    fire7-expand1x1         144830                  0.00066 
    fire7-expand3x3         145390                  0.00066 
    fire8-squeeze1x1        369605                  0.00168 
    fire8-expand1x1         245085                  0.00111 
    fire8-expand3x3         245208                  0.00111 
    fire9-squeeze1x1        490784                  0.00223 
    fire9-expand1x1         244864                  0.00111 
    fire9-expand3x3         245458                  0.00112 
    customConv1           17592876                  0.07997 
    customOutputConv1       952889                  0.00433 
    featureConv2            913457                  0.00415 
    featureResize2           57819                  0.00026 
    customConv2            5600648                  0.02546 
    customOutputConv2       526143                  0.00239 
 * The clock frequency of the DL processor is: 220MHz
[bboxes, scores, labels] = processYOLOv3Output(anchorBoxes, inputSize, classNames, features, I);
resultImage = insertObjectAnnotation(I,'rectangle',bboxes,scores);
imshow(resultImage)

The FPGA returns a score prediction of 0.89605 with a bounding box drawn around the object in the image. The FPGA also returns a prediction of vehicle to the labels variable.

downloadPretrainedYOLOv3Detector Function

The downloadPretrainedYOLOv3Detector function to download the pretrained YOLO v3 detector network

function detector = downloadPretrainedYOLOv3Detector 
if ~exist('yolov3SqueezeNetVehicleExample_21aSPKG.mat', 'file')
    if ~exist('yolov3SqueezeNetVehicleExample_21aSPKG.zip', 'file')
        zipFile = matlab.internal.examples.downloadSupportFile('vision/data', 'yolov3SqueezeNetVehicleExample_21aSPKG.zip');
        copyfile(zipFile);
    end
    unzip('yolov3SqueezeNetVehicleExample_21aSPKG.zip');
end
pretrained = load("yolov3SqueezeNetVehicleExample_21aSPKG.mat");
detector = pretrained.detector;
disp('Downloaded pretrained detector');
end

processYOLOv3Output Function

The processYOLOv3Output function is attached as a helper file in this example's directory. This function converts the feature maps from multiple detection heads to bounding boxes, scores and labels. A code snippet of the function is shown below.

function [bboxes, scores, labels] = processYOLOv3Output(anchorBoxes, inputSize, classNames, features, img)
% This function converts the feature maps from multiple detection heads to bounding boxes, scores and labels
% processYOLOv3Output is C code generatable

% Breaks down the raw output from predict function into Confidence score, X, Y, Width,
% Height and Class probabilities for each output from detection head
predictions = iYolov3Transform(features, anchorBoxes);

% Initialize parameters for post-processing
inputSize2d = inputSize(1:2);
info.PreprocessedImageSize = inputSize2d(1:2);
info.ScaleX = size(img,1)/inputSize2d(1);
info.ScaleY = size(img,2)/inputSize2d(1);
params.MinSize = [1 1];
params.MaxSize = size(img(:,:,1));
params.Threshold = 0.5;
params.FractionDownsampling = 1;
params.DetectionInputWasBatchOfImages = false;
params.NetworkInputSize = inputSize;
params.DetectionPreprocessing = "none";
params.SelectStrongest = 1;
bboxes = [];                                                                                                                               
scores = [];                                                                                                                                
labels = [];                                                                                                                             

% Post-process the predictions to get bounding boxes, scores and labels
[bboxes, scores, labels] = iPostprocessMultipleDetection(anchorBoxes, inputSize, classNames, predictions, info, params);
end

function [bboxes, scores, labels] = iPostprocessMultipleDetection (anchorBoxes, inputSize, classNames, YPredData, info, params)
% Post-process the predictions to get bounding boxes, scores and labels

% YpredData is a (x,8) cell array, where x = number of detection heads
% Information in each column is:
% column 1 -> confidence scores
% column 2 to column 5 -> X offset, Y offset, Width, Height of anchor boxes
% column 6 -> class probabilities
% column 7-8 -> copy of width and height of anchor boxes

% Initialize parameters for post-processing
classes = classNames;
predictions = YPredData;
extractPredictions = cell(size(predictions));
% Extract dlarray data
for i = 1:size(extractPredictions,1)
    for j = 1:size(extractPredictions,2)
        extractPredictions{i,j} = extractdata(predictions{i,j});
    end
end

% Storing the values of columns 2 to 5 of extractPredictions
% Columns 2 to 5 represent information about X-coordinate, Y-coordinate, Width and Height of predicted anchor boxes
extractedCoordinates = cell(size(predictions,1),4);
for i = 1:size(predictions,1)
    for j = 2:5 
        extractedCoordinates{i,j-1} = extractPredictions{i,j};
    end
end

% Convert predictions from grid cell coordinates to box coordinates.
boxCoordinates = anchorBoxGenerator(anchorBoxes, inputSize, classNames, extractedCoordinates, params.NetworkInputSize);
% Replace grid cell coordinates in extractPredictions with box coordinates
for i = 1:size(YPredData,1)
    for j = 2:5 
        extractPredictions{i,j} = single(boxCoordinates{i,j-1});
    end
end

% 1. Convert bboxes from spatial to pixel dimension
% 2. Combine the prediction from different heads.
% 3. Filter detections based on threshold.

% Reshaping the matrices corresponding to confidence scores and  bounding boxes
detections = cell(size(YPredData,1),6);
for i = 1:size(detections,1)
    for j = 1:5
        detections{i,j} = reshapePredictions(extractPredictions{i,j});
    end
end
% Reshaping the matrices corresponding to class probablities
numClasses = repmat({numel(classes)},[size(detections,1),1]);
for i = 1:size(detections,1)
    detections{i,6} = reshapeClasses(extractPredictions{i,6},numClasses{i,1}); 
end

% cell2mat converts the cell of matrices into one matrix, this combines the
% predictions of all detection heads
detections = cell2mat(detections);

% Getting the most probable class and corresponding index
[classProbs, classIdx] = max(detections(:,6:end),[],2);
detections(:,1) = detections(:,1).*classProbs;
detections(:,6) = classIdx;

% Keep detections whose confidence score is greater than threshold.
detections = detections(detections(:,1) >= params.Threshold,:);

[bboxes, scores, labels] = iPostProcessDetections(detections, classes, info, params);
end

function [bboxes, scores, labels] = iPostProcessDetections(detections, classes, info, params)
% Resizes the anchor boxes, filters anchor boxes based on size and apply
% NMS to eliminate overlapping anchor boxes
if ~isempty(detections)

    % Obtain bounding boxes and class data for pre-processed image
    scorePred = detections(:,1);
    bboxesTmp = detections(:,2:5);
    classPred = detections(:,6);
    inputImageSize = ones(1,2);
    inputImageSize(2) = info.ScaleX.*info.PreprocessedImageSize(2);
    inputImageSize(1) = info.ScaleY.*info.PreprocessedImageSize(1);
    % Resize boxes to actual image size.
    scale = [inputImageSize(2) inputImageSize(1) inputImageSize(2) inputImageSize(1)];
    bboxPred = bboxesTmp.*scale;
    % Convert x and y position of detections from centre to top-left.
    bboxPred = iConvertCenterToTopLeft(bboxPred);

    % Filter boxes based on MinSize, MaxSize.
    [bboxPred, scorePred, classPred] = filterBBoxes(params.MinSize, params.MaxSize, bboxPred, scorePred, classPred);

    % Apply NMS to eliminate boxes having significant overlap
    if params.SelectStrongest
        [bboxes, scores, classNames] = selectStrongestBboxMulticlass(bboxPred, scorePred, classPred ,...
            'RatioType', 'Union', 'OverlapThreshold', 0.4);
    else
        bboxes = bboxPred;
        scores = scorePred;
        classNames = classPred;
    end

    % Limit width detections
    detectionsWd = min((bboxes(:,1) + bboxes(:,3)),inputImageSize(1,2));
    bboxes(:,3) = detectionsWd(:,1) - bboxes(:,1);

    % Limit height detections
    detectionsHt = min((bboxes(:,2) + bboxes(:,4)),inputImageSize(1,1));
    bboxes(:,4) = detectionsHt(:,1) - bboxes(:,2);
    bboxes(bboxes<1) = 1;

    % Convert classId to classNames.
    labels = categorical(classes,cellstr(classes));
    labels = labels(classNames);

else
    % If detections are empty then bounding boxes, scores and labels should
    % be empty
    bboxes = zeros(0,4,'single');
    scores = zeros(0,1,'single');
    labels = categorical(classes);
end
end

function x = reshapePredictions(pred)
% Reshapes the matrices corresponding to scores, X, Y, Width and Height to
% make them compatible for combining the outputs of different detection
% heads
[h,w,c,n] = size(pred);
x = reshape(pred,h*w*c,1,n);
end

function x = reshapeClasses(pred,numClasses)
% Reshapes the matrices corresponding to the class probabilities, to make it
% compatible for combining the outputs of different detection heads
[h,w,c,n] = size(pred);
numAnchors = c/numClasses;
x = reshape(pred,h*w,numClasses,numAnchors,n);
x = permute(x,[1,3,2,4]);
[h,w,c,n] = size(x);
x = reshape(x,h*w,c,n);
end

function bboxes = iConvertCenterToTopLeft(bboxes)
% Convert x and y position of detections from centre to top-left.
bboxes(:,1) = bboxes(:,1) - bboxes(:,3)/2 + 0.5;
bboxes(:,2) = bboxes(:,2) - bboxes(:,4)/2 + 0.5;
bboxes = floor(bboxes);
bboxes(bboxes<1) = 1;
end

function tiledAnchors = anchorBoxGenerator(anchorBoxes, inputSize, classNames,YPredCell,inputImageSize)
% Convert grid cell coordinates to box coordinates.
% Generate tiled anchor offset.
tiledAnchors = cell(size(YPredCell));
for i = 1:size(YPredCell,1)
    anchors = anchorBoxes{i,:};
    [h,w,~,n] = size(YPredCell{i,1});
    [tiledAnchors{i,2},tiledAnchors{i,1}] = ndgrid(0:h-1,0:w-1,1:size(anchors,1),1:n);
    [~,~,tiledAnchors{i,3}] = ndgrid(0:h-1,0:w-1,anchors(:,2),1:n);
    [~,~,tiledAnchors{i,4}] = ndgrid(0:h-1,0:w-1,anchors(:,1),1:n);
end

for i = 1:size(YPredCell,1)
    [h,w,~,~] = size(YPredCell{i,1});
    tiledAnchors{i,1} = double((tiledAnchors{i,1} + YPredCell{i,1})./w);
    tiledAnchors{i,2} = double((tiledAnchors{i,2} + YPredCell{i,2})./h);
    tiledAnchors{i,3} = double((tiledAnchors{i,3}.*YPredCell{i,3})./inputImageSize(2));
    tiledAnchors{i,4} = double((tiledAnchors{i,4}.*YPredCell{i,4})./inputImageSize(1));
end
end

function predictions = iYolov3Transform(YPredictions, anchorBoxes)
% This function breaks down the raw output from predict function into Confidence score, X, Y, Width,
% Height and Class probabilities for each output from detection head

predictions = cell(size(YPredictions,1),size(YPredictions,2) + 2);

for idx = 1:size(YPredictions,1)
    % Get the required info on feature size.
    numChannelsPred = size(YPredictions{idx},3);  %number of channels in a feature map
    numAnchors = size(anchorBoxes{idx},1);    %number of anchor boxes per grid
    numPredElemsPerAnchors = numChannelsPred/numAnchors;
    channelsPredIdx = 1:numChannelsPred;
    predictionIdx = ones([1,numAnchors.*5]);

    % X positions.
    startIdx = 1;
    endIdx = numChannelsPred;
    stride = numPredElemsPerAnchors;
    predictions{idx,2} = YPredictions{idx}(:,:,startIdx:stride:endIdx,:);
    predictionIdx = [predictionIdx startIdx:stride:endIdx];

    % Y positions.
    startIdx = 2;
    endIdx = numChannelsPred;
    stride = numPredElemsPerAnchors;
    predictions{idx,3} = YPredictions{idx}(:,:,startIdx:stride:endIdx,:);
    predictionIdx = [predictionIdx startIdx:stride:endIdx];

    % Width.
    startIdx = 3;
    endIdx = numChannelsPred;
    stride = numPredElemsPerAnchors;
    predictions{idx,4} = YPredictions{idx}(:,:,startIdx:stride:endIdx,:);
    predictionIdx = [predictionIdx startIdx:stride:endIdx];

    % Height.
    startIdx = 4;
    endIdx = numChannelsPred;
    stride = numPredElemsPerAnchors;
    predictions{idx,5} = YPredictions{idx}(:,:,startIdx:stride:endIdx,:);
    predictionIdx = [predictionIdx startIdx:stride:endIdx];

    % Confidence scores.
    startIdx = 5;
    endIdx = numChannelsPred;
    stride = numPredElemsPerAnchors;
    predictions{idx,1} = YPredictions{idx}(:,:,startIdx:stride:endIdx,:);
    predictionIdx = [predictionIdx startIdx:stride:endIdx];

    % Class probabilities.
    classIdx = setdiff(channelsPredIdx,predictionIdx);
    predictions{idx,6} = YPredictions{idx}(:,:,classIdx,:);
end

for i = 1:size(predictions,1)
    predictions{i,7} = predictions{i,4};
    predictions{i,8} = predictions{i,5};
end

% Apply activation to the predicted cell array
% Apply sigmoid activation to columns 1-3 (Confidence score, X, Y)
for i = 1:size(predictions,1)
    for j = 1:3
        predictions{i,j} = sigmoid(predictions{i,j});
    end
end
% Apply exponentiation to columns 4-5 (Width, Height)
for i = 1:size(predictions,1)
    for j = 4:5
        predictions{i,j} = exp(predictions{i,j});
    end
end
% Apply sigmoid activation to column 6 (Class probabilities)
for i = 1:size(predictions,1)
    for j = 6
        predictions{i,j} = sigmoid(predictions{i,j});
    end
end
end

function [bboxPred, scorePred, classPred] = filterBBoxes(minSize, maxSize, bboxPred, scorePred, classPred)
% Filter boxes based on MinSize, MaxSize
[bboxPred, scorePred, classPred] = filterSmallBBoxes(minSize, bboxPred, scorePred, classPred);
[bboxPred, scorePred, classPred] = filterLargeBBoxes(maxSize, bboxPred, scorePred, classPred);
end

function varargout = filterSmallBBoxes(minSize, varargin)
% Filter boxes based on MinSize
bboxes = varargin{1};
tooSmall = any((bboxes(:,[4 3]) < minSize),2);
for ii = 1:numel(varargin)
    varargout{ii} = varargin{ii}(~tooSmall,:);
end
end

function varargout = filterLargeBBoxes(maxSize, varargin)
% Filter boxes based on MaxSize
bboxes = varargin{1};
tooBig = any((bboxes(:,[4 3]) > maxSize),2);
for ii = 1:numel(varargin)
    varargout{ii} = varargin{ii}(~tooBig,:);
end
end

function m = cell2mat(c)
% Converts the cell of matrices into one matrix by concatenating
% the output corresponding to each feature map

elements = numel(c);
% If number of elements is 0 return an empty array
if elements == 0
    m = [];
    return
end
% If number of elements is 1, return same element as matrix
if elements == 1
    if isnumeric(c{1}) || ischar(c{1}) || islogical(c{1}) || isstruct(c{1})
        m = c{1};
        return
    end
end
% Error out for unsupported cell content
ciscell = iscell(c{1});
cisobj = isobject(c{1});
if cisobj || ciscell
    disp('CELL2MAT does not support cell arrays containing cell arrays or objects.');
end
% If input input is struct, extract field names of structure into a cell
if isstruct(c{1})
    cfields = cell(elements,1);
    for n = 1:elements
        cfields{n} = fieldnames(c{n});
    end
    if ~isequal(cfields{:})
        disp('The field names of each cell array element must be consistent and in consistent order.');
    end
end
% If number of dimensions is 2 
if ndims(c) == 2
    rows = size(c,1);
    cols = size(c,2);
    if (rows < cols)
        % If rows is less than columns first concatenate each column into 1
        % row then concatenate all the rows
        m = cell(rows,1);
        for n = 1:rows
            m{n} = cat(2,c{n,:});
        end
        m = cat(1,m{:});
    else
        % If columns is less than rows, first concatenate each corresponding
        % row into columns, then combine all columns into 1
        m = cell(1,cols);
        for n = 1:cols
            m{n} = cat(1,c{:,n});
        end
        m = cat(2,m{:});
    end
    return
end
end

References

[1] Redmon, Joseph, and Ali Farhadi. “YOLOv3: An Incremental Improvement.” Preprint, submitted April 8, 2018. https://arxiv.org/abs/1804.02767.

Version History

Introduced in R2020b