Deformable Convolutional Networks

last modified : 27-11-2018

General Information

Title: Deformable Convolutional Networks
Authors: Jifeng Dai, Haozhi Qi, yYuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei
Link: article
Date of first submission: 17 March 2017
Implementations:
- MXNet/Python

Brief

In this paper, the authors argue that neural networks are limited to model geometric transformation due to the fixed nature of the layers making up the network. To remove this constraint, they introduce layers that can adapt to the shape of the object to locate (i.e rather than forcing the network into some pre-set filters, let it learn for itself what is best). The deformable layers can replace any layers in any network easily with not much increase in the computation cost (deformable convolutional layers can replace convolutional layers, deformable pooling layers can replace pooling layer). Their layers can be learnt through back-propagation.

How Does It Work

The articles presents 3 types of layers, the deformable convolution, the deformable RoI pooling and the deformable PS RoI pooling.

All the deformable layers are fairly similar in conception, a branch process the input feature map to get the offsets, and then bilinear interpolation is applied to the input feature map at the position of the offset to get the value of the output.

Deformable Convolution:

Deformable Convolution

Deformable roi pooling:

Deformable roi pooling

Deformable ps roi pooling:

Deformable ps roi pooling

Results

The table below shows the results of various architecture with and without the deformable layers. The architecture with the deformable layers performs systematically better than the ones without.

Object detection results of deformable ConvNets v.s. plain ConvNets on COCO test-dev set. M denotes multi-scale testing, and B denotes iterative bounding box average:

method	backbone architecture	M	B	mAP@(0.5:0.95)	mAP@0.5	mAP@(0.5:0.95) (small)	mAP@(0.5:0.95) (mid)	mAP@(0.5:0.95) (large)
class-aware RPN	ResNet-101			23.2	42.6	6.9	27.1	35.1
Ours	ResNet-101			25.8	45.9	7.2	28.3	40. 7
Faster RCNN	ResNet-101			29.4	48.0	9.0	30.5	47.1
Ours	ResNet-101			33.1	50.3	11.6	34.9	51. 2
R-FCN	ResNet-101			30.8	52.6	11.8	33.9	44.8
Ours	ResNet-101			34.5	55.0	14.0	37.7	50. 3
Faster RCNN	Aligned-Inception-ResNet			30.8	49.6	9.6	32.5	49.0
Ours	Aligned-Inception-ResNet			34.1	51.1	12.2	36.5	52.4
R-FCN	Aligned-Inception-ResNet			32.9	54.5	12.5	36.3	48.3
Ours	Aligned-Inception-ResNet			36.1	56.7	14.8	39.8	52.2
R-FCN	Aligned-Inception-ResNet	X		34.5	55.0	16.8	37.3	48.3
Ours	Aligned-Inception-ResNet	X		37.1	57.3	18.8	39.7	52.3
R-FCN	Aligned-Inception-ResNet	X	X	35.5	55.6	17.8	38.4	49.3
Ours	Aligned-Inception-ResNet	X	X	37.5	58.0	19.4	40.1	52.5

In Depth

Deformable Convolution

A convolution is applied to the input feature map. The offset field is of size 2N (N 2D offset, [(x1,y1), (x2, y2), ...]). Then for a position on the output map, the value is calculated using the offsets.

The input and output feature maps have the same dimension and if the offset is in between cells in the grid, the values for the convolution are interpolated.

The conv layer is learned with backpropagation.

The other layers work in a similar fashion, see the article for more details on the other layers.