RCNN

last modified : 13-06-2019

General Information

Title: Rich feature hierarchies for accurate object detection and semantic segmentation
Authors: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik
Link: article
Date of first submission: 11 November 2013
Implementations:
- Caffe

Brief

This network is one of the pioneers for object detection. In its conception it is tightly linked to the OverFeat network, as described in the article : "OverFeat can be seen (roughly) as a special case of R-CNN.".

Even if the architecture of the network is inspired by OverFeat, the RCNN outperformed all of the results at the time of its publication.

One of the main contribution of the paper is to demonstrate the gain obtained when pre-training on large auxiliary dataset and then training on the target set.

This is not an end-to-end classifier.

How Does It Work

The network is made of three main parts, the region extractor, the feature extractor and finally the classifier. The whole network is shown in the following image:

RCNN Network

A region proposal algorithm extract ROI, then each region is fed to a classifier and finally the extracted features are classified.

Results

Results for the PASCAL VOC 2010 challenge :

Model	mAP
RCNN	53.7
SegDPM	40.4
Regionlets	39.7
UVA	35.1
DPM v5	33.4

Results for the ILSVRC 2013 challenge :

Model	mAP
RCNN	31.4
OverFeat	24.3

In Depth

At test time

The first part of the network uses the selective search algorithm to generate around 2k boxes of possible objects.

Then, second part of the network uses the network from Krizhevsky et al. to generate a 4096-dimensional feature vector from each boxes that were proposed. The input image is a 227x227 mean-subtracted wrapped RGB image.

Finally the correct class is extracted using a SVM and Non-Maximum suppression from all the boxes.

At training time

The CNN is trained with the fully connected layers at the end modified to match the number of classes in the dataset.

The network is first pre-trained "on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only", with all the classes. Then the network is fine tuned by replacing the fully connected layer with a smaller one to match the number of class.

Finally, once the CNN part has converged, the SVMs are trained.