Getting Started with YOLO v4

The you only look once version 4 (YOLO v4) object detection network is a one-stage object detection network and is composed of three parts: backbone, neck, and head.

The backbone can be a pretrained convolutional neural network such as VGG16 or CSPDarkNet53 trained on COCO or ImageNet data sets. The backbone of the YOLO v4 network acts as the feature extraction network that computes feature maps from the input images.
The neck connects the backbone and the head. It is composed of a spatial pyramid pooling (SPP) module and a path aggregation network (PAN). The neck concatenates the feature maps from different layers of the backbone network and sends them as inputs to the head.
The head processes the aggregated features and predicts the bounding boxes, objectness scores, and classification scores. The YOLO v4 network uses one-stage object detectors, such as YOLO v3, as detection heads.

YOLO v4 network architecture

The YOLO v4 network uses CSPDarkNet-53 as the backbone for extracting features from the input images. The backbone has five residual block modules, and the feature map outputs from the residual block modules are fused at the neck of the YOLO v4 network.

The SPP module in the neck concatenates the max-pooling outputs of the low-resolution feature map to extract the most representative features. The SPP module uses kernels of size 1-by-1, 5-by-5, 9-by-9, and 13-by-13 for the max-pooling operation. The stride value is set to 1. Concatenating the feature maps increases the receptive field of backbone features and increases the accuracy of the network for detecting small objects. The concatenated feature maps from the SPP module are fused with the high-resolution feature maps by using a PAN. The PAN uses upsampling and downsampling operations to set bottom-up and top-down paths for combining the low-level and high-level features.

The PAN module outputs a set of aggregated feature maps to use for predictions. The YOLO v4 network has three detection heads. Each detection head is a YOLO v3 network that computes the final predictions. The YOLO v4 network outputs feature maps of sizes 19-by-19, 38-by-38, and 76-by-76 to predict the bounding boxes, classification scores, and objectness scores.

Tiny YOLO v4 network is a lightweight version of the YOLO v4 network with fewer network layers. The tiny YOLO v4 network uses a feature pyramid network as the neck and has two YOLO v3 detection heads. The network outputs feature maps of size 13-by-13 and 26-by-26 for computing predictions.

Predict Objects Using YOLO v4

YOLO v4 uses anchor boxes to detect classes of objects in an image. For details about anchor boxes, see Anchor Boxes for Object Detection. Similar to YOLO v3, YOLO v4 predicts these three attributes for each anchor box:

Intersection over union (IoU) — Predicts the objectness score of each anchor box.
Anchor box offsets — Refines the anchor box position.
Class probability — Predicts the class label assigned to each anchor box.

The figure shows predefined anchor boxes, represented by dotted lines, at each location in a feature map, and the refined location after applying the offsets. The anchor boxes that have been matched with a class are in color.

Demonstration of anchor boxes

You must specify the predefined anchor boxes, also known as a priori boxes, and the classes while training the network.

Create YOLO v4 Object Detection Network

To programmatically create a YOLO v4 deep learning network, use the yolov4ObjectDetector object. You can create a yolov4ObjectDetector object, to detect objects in an image, using the pretrained YOLO v4 deep learning networks csp-darknet53-coco and tiny-yolov4-coco. These networks are trained on the COCO data set. csp-darknet53-coco is a YOLO v4 network with three detection heads, and tiny-yolov4-coco is a tiny YOLO v4 network with two detection heads. To download these YOLO v4 pretrained networks, you must install the Computer Vision Toolbox™ Model for YOLO v4 Object Detection support package.

Train and Detect Objects Using YOLO v4 Network

To train a YOLO v4 object detection network on a labeled dataset, use the trainYOLOv4ObjectDetector function. You must specify the class names and the predefined anchor boxes for the data set you use to train the network.

The training function returns the trained network as a yolov4ObjectDetector object. You can then use the detect function to detect unknown objects in a test image with the trained YOLO v4 object detector. To learn how to create a YOLO v4 object detector and train for object detection, see the Object Detection Using YOLO v4 Deep Learning example.

Specify Anchor Boxes

The shape, size, and number of anchor boxes used for training impact the efficiency and accuracy of the YOLO v4 object detection network. The anchor boxes must closely represent the sizes and aspect ratios of the objects in the training data. The training data must contain both the ground truth images and labels. The size of the training images must be the same as the network input size, and the bounding box labels must correspond to the size of the training images.

You must assign the same number of anchor boxes to each detection head in the YOLO v4 network. The size of the anchor boxes assigned to each detection head must correspond to the size of the feature map output from the detection head. You must assign large anchor boxes to detection heads with lower resolution feature maps and small anchor boxes to detection heads with higher resolution feature maps.

For example, these steps show you how to specify anchor boxes to train a YOLO v4 network that has three detection heads with feature map sizes of 19-by-19, 38-by-38, and 76-by-76, respectively.

Assume that you specify four anchor boxes for each detection head. Then, the total number of anchor boxes that you use for training the network must be twelve. You can use the estimateAnchorBoxes function to automatically estimate the anchor boxes for your specified training data.
```
numAnchors = 12;
[anchors] = estimateAnchorBoxes(trainingData,numAnchors);
```
Compute the area of each anchor box and sort them in descending order.
```
area = anchors(:,1).*anchors(:,2);
[~,idx] = sort(area,"descend");
sortedAnchors = anchors(idx,:)
```
There are three detection heads in the YOLO v4 network, so make three sets of four anchor boxes each.
```
anchorBoxes = {sortedAnchors(1:4,:) sortedAnchors(5:8,:) sortedAnchors(9:12,:)};
```
Create a YOLO v4 object detection network by using the yolov4ObjectDetector function. Specify the classes and the sorted anchor boxes. The function assigns the first set of anchor boxes to the first detection head, the second set to the second detection head, and so on. The first four anchor boxes have large areas and must be assigned to the first detection head, which outputs the lower resolution 19-by-19 feature map. The next four anchor boxes must be assigned to the second detection head, which outputs the feature map of size 38-by-38. The last four anchor boxes are assigned to the third detection head that outputs the highest resolution 76-by-76 feature map.
```
detector = yolov4ObjectDetector("csp-darknet53-coco","car",anchorBoxes);
```

Train the detector by using the trainYOLOv4ObjectDetector function.

detector = trainYOLOv4ObjectDetector(trainingData,detector,trainingOptions);

Transfer Learning

To perform transfer learning, use a pretrained convolutional neural network (CNN) as the base network for a YOLO v4 deep learning network. Configure the YOLO v4 deep learning network for training on a new data set by specifying the anchor boxes and the new object classes. Use the yolov4ObjectDetector object to create a custom YOLO v4 detection network from any pretrained CNN, such as ResNet-50. Then, train the network by using the trainYOLOv4ObjectDetector function.

For information about how to create a custom YOLO v4 object detector, see Create Custom YOLO v4 Object Detector.

Label Training Data for Deep Learning

You can use the Image Labeler, Video Labeler, or Ground Truth Labeler (Automated Driving Toolbox) app to interactively label pixels and export label data for training. You can also use the apps to label axis-aligned and rotated rectangular regions of interest (ROIs) for object detection, scene labels for image classification, and pixels for semantic segmentation. To create training data from a ground truth object exported by any of the labelers, use the objectDetectorTrainingData or pixelLabelTrainingData functions. For more details, see Training Data for Object Detection and Semantic Segmentation.

References

[1] Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. “YOLOv4: Optimal Speed and Accuracy of Object Detection.” ArXiv:2004.10934 [Cs, Eess], April 22, 2020. https://arxiv.org/abs/2004.10934.

[2] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788. Las Vegas, NV: USA: IEEE, 2016. https://doi.org/10.1109/CVPR.2016.91.

[3] Simon, Martin, Stefan Milz, Karl Amende, and Horst-Michael Gross. "Complex-yolo: Real-time 3d object detection on point clouds." arXiv preprint arXiv:1803.06199 (2018).