Getting Started with YOLO v3

The you-only-look-once (YOLO) v3 object detector is a multi-scale object detection network that uses a feature extraction network and multiple detection heads to make predictions at multiple scales.

The YOLO v3 object detection model runs a deep learning convolutional neural network (CNN) on an input image to produce network predictions from multiple feature maps. The object detector gathers and decodes predictions to generate the bounding boxes.

Predicting Objects in the Image

YOLO v3 uses anchor boxes to detect classes of objects in an image. For more details, see Anchor Boxes for Object Detection.The YOLO v3 predicts these three attributes for each anchor box:

Intersection over union (IoU) — Predicts the objectness score of each anchor box.
Anchor box offsets — Refine the anchor box position
Class probability — Predicts the class label assigned to each anchor box.

The figure shows predefined anchor boxes (the dotted lines) at each location in a feature map and the refined location after offsets are applied. Matched boxes with a class are in color.

Design a YOLO v3 Detection Network

To design a YOLO v3 object detection network, follow these steps.

Start the model with a feature extraction network. The feature extraction network serves as the base network for creating the YOLO v3 deep learning network. The base network can be a pretrained or untrained CNN. If the base network is a pretrained network, you can perform transfer learning.
Create detection subnetworks by using convolution, batch normalization, and ReLu layers. Add the detection subnetworks to any of the layers in the base network. The output layers that connect as inputs to the detection subnetworks are the detection network source. Any layer from the feature extraction network can be used as a detection network source. To use multiscale features for object detection, choose feature maps of different sizes.

To manually create a YOLO v3 deep learning network, use the Deep Network Designer (Deep Learning Toolbox) app. To programmatically create a YOLO v3 deep learning network, use the yolov3ObjectDetector object.

Transfer Learning

To perform transfer learning, you can use a pretrained deep learning network as the base network for YOLO v3 deep learning network. Configure the YOLO v3 deep learning for training on a new dataset by specifying the anchor boxes and the new object classes. Use the yolov3ObjectDetector object to create a YOLO v3 detection network from any pretrained CNN, like SqueezeNet and perform transfer learning. For a list of pretrained CNNs, see Pretrained Deep Neural Networks (Deep Learning Toolbox).

Train an Object Detector and Detect Objects with a YOLO v3 Model

To train a YOLO v3 object detection network on a labeled dataset, use the trainYOLOv3ObjectDetector function . You must specify the class names and the predefined anchor boxes for the data set you use to train the network.

The training function returns the trained network as a yolov3ObjectDetector object. You can then use the detect function to detect unknown objects in a test image with the trained YOLO v4 object detector. To learn how to create a custom YOLO v3 object detector by using a deep learning network as base network and train for object detection, see the Object Detection Using YOLO v3 Deep Learning example.

Label Training Data for Deep Learning

You can use the Image Labeler, Video Labeler, or Ground Truth Labeler (Automated Driving Toolbox) apps to interactively label pixels and export label data for training. The apps can also be used to label rectangular regions of interest (ROIs) for object detection, scene labels for image classification, and pixels for semantic segmentation. To create training data from any of the labelers exported ground truth object, you can use the objectDetectorTrainingData or pixelLabelTrainingData functions. For more details, see Training Data for Object Detection and Semantic Segmentation.

References

[1] Redmon, Joseph, and Ali Farhadi. “YOLO9000: Better, Faster, Stronger.” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–25. Honolulu, HI: IEEE, 2017. https://doi.org/10.1109/CVPR.2017.690.

[2] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You only look once: Unified, real-time object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788. Las Vegas, NV: CVPR, 2016.