YOLO9000

last modified : 13-06-2019

General Information

Title: YOLO9000: Better, Faster, Stronger
Authors: Joseph Redmon, Ali Farhadi
Link: article
Date of first submission:25 December 2016
Implementations:
- darknet
- Tensorflow

Brief

This is an improved version of the YOLO network. It was designed to palliate to some defect of the YOLO: the precision of the network and the level of recall. On top of that, the new version can now predict up to 9000 classes and predict unseen classes.

How Does It Work

The main idea stays the same as for the YOLO network, but the last layers are not fully connected but convolutional ones. The number of convolutional layers is increased, but since the fully connected layers are removed, the speed is not impacted.

Results

Only the main results are show here, for more go check the article.

PASCAL VOC 2007 detection results:

Network	Train	mAP	FPS
Fast R-CNN	2007+2012	70.0	0.5
Faster R-CNN VGG-16	2007+2012	73.2	7
Faster R-CNN ResNet	2007+2012	76.4	5
YOLO	2007+2012	63.4	45
SSD300	2007+2012	74.3	46
SSD500	2007+2012	76.8	19
YOLOv2 288x88	2007+2012	69.0	91
YOLOv2 352x352	2007+2012	73.7	81
YOLOv2 416x416	2007+2012	76.8	67
YOLOv2 480x480	2007+2012	77.8	59
YOLOv2 544x544	2007+2012	78.6	40

PASCAL VOC2012 test detection results:

Network	data	mAP
Fast R-CNN	07++12	68.4
Faster R-CNN	07++12	70.4
YOLO	07++12	57.9
SSD300	07++12	72.4
SSD512	07++12	74.9
ResNet	07++12	73.8
YOLOv2 544	07++12	73.4

In Depth

The new version of the YOLO uses many techniques to improve the results of the previous version.

The first two methods used are batch normalization and increase in the resolution of the input images. They add batch normalization on every convolutional layer of the network to get an improvement of mAP of 2%. And they increase the size of resolution for detection to 448x448 and get a 4% increase in the mAP.

The other improvements is the use of anchor boxes picked using the k-means algorithm. They use the k-means algorithm to pick anchor boxes fitting best the distribution of their objects to detect in the images.

Finally they changed the dimension of the input images during the training to have their network learn to do multi scale classification.