last modified : 13-06-2019

General Information

  • Title: YOLO9000: Better, Faster, Stronger
  • Authors: Joseph Redmon, Ali Farhadi
  • Link: article
  • Date of first submission:25 December 2016
  • Implementations:


This is an improved version of the YOLO network. It was designed to palliate to some defect of the YOLO: the precision of the network and the level of recall. On top of that, the new version can now predict up to 9000 classes and predict unseen classes.

How Does It Work

The main idea stays the same as for the YOLO network, but the last layers are not fully connected but convolutional ones. The number of convolutional layers is increased, but since the fully connected layers are removed, the speed is not impacted.


Only the main results are show here, for more go check the article.

PASCAL VOC 2007 detection results:

Network Train mAP FPS
Fast R-CNN 2007+2012 70.0 0.5
Faster R-CNN VGG-16 2007+2012 73.2 7
Faster R-CNN ResNet 2007+2012 76.4 5
YOLO 2007+2012 63.4 45
SSD300 2007+2012 74.3 46
SSD500 2007+2012 76.8 19
YOLOv2 288x88 2007+2012 69.0 91
YOLOv2 352x352 2007+2012 73.7 81
YOLOv2 416x416 2007+2012 76.8 67
YOLOv2 480x480 2007+2012 77.8 59
YOLOv2 544x544 2007+2012 78.6 40

PASCAL VOC2012 test detection results:

Network data mAP
Fast R-CNN 07++12 68.4
Faster R-CNN 07++12 70.4
YOLO 07++12 57.9
SSD300 07++12 72.4
SSD512 07++12 74.9
ResNet 07++12 73.8
YOLOv2 544 07++12 73.4

In Depth

The new version of the YOLO uses many techniques to improve the results of the previous version.

The first two methods used are batch normalization and increase in the resolution of the input images. They add batch normalization on every convolutional layer of the network to get an improvement of mAP of 2%. And they increase the size of resolution for detection to 448x448 and get a 4% increase in the mAP.

The other improvements is the use of anchor boxes picked using the k-means algorithm. They use the k-means algorithm to pick anchor boxes fitting best the distribution of their objects to detect in the images.

Finally they changed the dimension of the input images during the training to have their network learn to do multi scale classification.