YOLO

last modified : 13-06-2019

General Information

Title: You Only Look Once: Unified, Real-Time Object Detection
Authors: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
Link: article
Date of first submission: 8 Jun 2015
Implementations:
- Darknet

Brief

YOLO is a one shot detectors, meaning that it only does one pass on the images to output all the detections. The obvious advantage in this method is the speed up in the computation and the increase in the number of frame being processed by second. The downside of this method is to have mAP a bit under the top classifiers.

How Does It Work

The architecture of the network is quite simple, it is a series of convolutional layers followed by fully connected layers.

The following diagram shows the layers of the network:

YOLO architecture

The main idea is to have a grid of boxes to cover all the image being processed. The last layer contains all the boxes, coordinates and classes. This way you can cover the whole image with a pre-defined set of boxes.

Results

The final results take from the article are the following ones:

VOC 2012

Network	mAP
Fast R-CNN + YOLO	70,7
Faster R-CN	70,4
YOLO	57,9

Speed

Real-Time Detectors	Train	mAP	FPS
100Hz DPM	2007	16.0	100
30Hz DPM	2007	26.1	30
Fast YOLO	2007+2012	52.7	155
YOLO	2007+2012	63.4	45

Less Than Real-Time	Train	mAP	FPS
Fastest DPM	2007	30.4	15
R-CNN Minus R	2007	53.5	6
Fast R-CNN	2007+2012	70.0	0.5
Faster R-CNN VGG-16	2007+2012	73.2	7
Faster R-CNN ZF	2007+2012	62.1	18
YOLO VGG-16	2007+2012	66.4	21

In Depth

We will only detail quickly the way of work of the grid of boxes. For more details, check this link, it explains very clearly all the details of the network.

If we take a look at the image above (how does it works), we can see the size of the last layer to be 7x7x30, this is the output size for the PASCAL VOC challenge.

The 7X7 is the size of the grid (see image below), the 30 is for the 20 classes of the network + 2 boxes (4 coordinates and confidence). It is important to see that only one class probability is used for two boxes.

yolo grid

Then, for each two boxes of the grid, the coordinates are regressed, meaning that a box can actually be way bigger than the size of the cell. After that, the usual processing of the boxes to get the output.