论文阅读笔记(三十四):An Analysis of Scale Invariance in Object Detection-SNIP

xiaoxiao2021-02-28  8

An analysis of different techniques for recognizing and detecting objects under extreme scale variation is presented. Scale specific and scale invariant design of detectors are compared by training them with different configurations of input data. To examine if upsampling images is necessary for detecting small objects, we evaluate the performance of different network architectures for classifying small objects on ImageNet. Based on this analysis, we propose a deep end-to-end trainable Image Pyramid Network for object detection which operates on the same image scales during training and inference. Since small and large objects are difficult to recognize at smaller and larger scales respectively, we present a novel training scheme called Scale Normalization for Image Pyramids (SNIP) which selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. On the COCO dataset, our single model performance is 45.7% and an ensemble of 3 networks obtains an mAP of 48.3%. We use ImageNet-1000 pre-trained models and only train with bounding box supervision. Our submission won the Best Student Entry in the COCO 2017 challenge. Code will be made available at http://bit.ly/2yXVg4c.

分析了在极端尺度变化下识别和检测物体的不同技术。通过对不同的输入数据配置进行训练, 比较了检测器设计的 Scale specific 和 scale invariant。为了检查是否需要 upsampling 图像来检测小物体, 我们评估了不同网络体系结构对 ImageNet 上的小物体进行分类的性能。在此基础上, 提出了一种深端到端可训练图像金字塔网络, 用于目标检测, 在训练和推理过程中在同一图像尺度上操作。由于小规模和大型物体难以识别在较小和更大的尺度上, 我们提出了一个新的训练方案称为Scale Normalization for Image Pyramids (SNIP), 有选择地反向传播的物体实例的梯度不同大小的图像缩放功能。在COCO数据集上, 我们的单模型性能为 45.7%, 3个网络的集成得到了48.3% 的映射。我们使用 ImageNet-1000 预先训练的模型, 只训练与边界箱监督。我们的投稿赢得了COCO2017挑战的Best Student Entry。代码将在 http://bit.ly/2yXVg4c 提供。

Deep learning has fundamentally changed how computers perform image classification and object detection. In less than five years, since AlexNet [18] was proposed, the top-5 error on ImageNet classification [8] has dropped from 15% to 2% [14]. This is super-human level performance for image classification with 1000 classes. On the other hand, the mAP of the best performing detector [16] (which is only trained to detect 80 classes) on COCO [23] is only 62% – even at 50% overlap. Why is object detection so much harder than image classification?

深度学习从根本上改变了计算机执行图像分类和物体检测的方式。在不到五年的时间里,由于AlexNet [18]被提出,ImageNet分类[8]中的前5个误差已从15%下降到2%[14]。这是1000级图像分类的超人类级别性能。另一方面,COCO [23]上性能最好的检测器[16](仅经过训练检测80个类别)的mAP仅为62% - 即使在50%重叠的情况下。为什么物体检测比图像分类难得多?

Large scale variation across object instances, and especially, the challenge of detecting very small objects stands out as one of the factors behind the difference in performance. Interestingly, the median scales of object instances relative to the image in ImageNet (classification) vs COCO (detection) are 0.554 and 0.106 respectively. Therefore, most object instances in COCO are smaller than 1% of image area! To make matters worse, the scale of the smallest and largest 10% of object instances in COCO is 0.024 and 0.472 respectively (resulting in scale variations of almost 20 times!); see Fig. 1. This variation in scale which a detector needs to handle is enormous and represents an extreme challenge to the scale invariance properties of convolutional neural networks. Moreover, differences in the scale of object instances between classification and detection datasets also results in a large domain-shift while finetuning from a pre-trained classification network. In this paper, we first provide evidence of these problems and then propose a training scheme called Scale Normalization for Image Pyramids which leads to a state-of-the-art object detector on COCO.

物体实例之间的大范围变化,尤其是检测非常小的物体的挑战是性能差异背后的因素之一。有趣的是,物体实例相对于ImageNet(分类)与COCO(检测)中的图像的中值尺度分别为0.554和0.106。因此,COCO中的大多数物体实例都小于图像区域的1%!更糟的是,COCO中最小和最大的10%物体实例的规模分别为0.024和0.472(导致规模变化近20倍!);见图1.检测器需要处理的这种尺度变化是巨大的,并且对卷积神经网络的尺度不变特性提出了极大的挑战。此外,分类和检测数据集之间的物体实例规模的差异也导致了一个大的domain-shift,同时从预先训练的分类网络进行微调。在本文中,我们首先提供这些问题的证据,然后提出一种称为Scale Normalization for Image Pyramids的训练方案,其导致COCO上的最先进的物体检测器。

To alleviate the problems arising from scale variation and small object instances, multiple solutions have been proposed. For example, features from the layers near to the input, referred to as shallow(er) layers, are combined with deeper layers for detecting small object instances [21, 33, 1, 11, 25], dilated/deformable convolution is used to increase receptive fields for detecting large objects [30, 6, 37, 7], independent predictions at layers of different resolutions are used to capture object instances of different scales [35, 3, 20], context is employed for disambiguation [39, 40, 9], training is performed over a range of scales [6, 7, 13] or, inference is performed on multiple scales of an image pyramid and predictions are combined using nonmaximum suppression [6, 7, 2, 31].

为了缓解由尺度变化和小物体实例引起的问题,已经提出了多种解决方案。例如,输入附近的层(称为shallow(er) layers)的特征与更深的层相结合以检测小物体实例[21,33,1,11,25],dilated/deformable convolution被用于增加用于检测大物体的感受野[30,6,37,7],使用不同分辨率层的独立预测来捕获不同分辨率的物体实例[35,3,20],采用上下文进行disambiguation[39,40 ,9],在一系列尺度上进行训练[6,7,13],或者在图像金字塔的多个尺度上进行推理,并使用nonmaximum suppression[6,7,2,31]合并预测。

While these architectural innovations have significantly helped to improve object detection, many important issues related to training remain unaddressed: • Is it critical to upsample images for obtaining good performance for object detection? Even though the typical size of images in detection datasets is 480x640, why is it a common practice to up-sample them to 800x1200? Can we pre-train CNNs with smaller strides on low resolution images from ImageNet and then fine-tune them on detection datasets for detecting small object instances? • Whenfine-tuninganobjectdetectorfromapre-trained image classification model, should the resolution of the training object instances be restricted to a tight range (from 64x64 to 256x256) after appropriately re-scaling the input images, or should all object resolutions (from 16x16 to 800x1000, in the case of COCO) participate in training after up-sampling input images?

虽然这些架构创新显着帮助改善了物体检测,但与训练相关的许多重要问题仍未得到解决: •上采样图像以获得物体检测的良好性能是否至关重要?尽管检测数据集中图像的典型大小为480x640,但为什么将它们上采样到800x1200是一种常见做法?我们是否可以在ImageNet的低分辨率图像上以更小的stride预先训练CNN,然后在检测数据集上对它们进行微调以检测小物体实例? •当精细调整物体检测器为预先训练的图像分类模型时,如果适当地重新缩放输入图像,或者所有物体分辨率(从16x16到800x1000,在COCO的情况下)在上采样输入图像之后参加训练?

We design controlled experiments on ImageNet and COCO to seek answers to these questions. In Section 3, we study the effect of scale variation by examining the performance of existing networks for ImageNet classification when images of different scales are provided as input. We also make minor modifications to the CNN architecture for classifying images of different scales. These experiments reveal the importance of up-sampling for small object detection. To analyze the effect of scale variation on object detection, we train and compare the performance of scalespecific and scale invariant detector designs in Section 5. For scale-specific detectors, variation in scale is handled by training separate detectors one for each scale range. Moreover, training the detector on similar scale object instances as the pre-trained classification network helps to reduce the domain shift for the detector backbone. But, scalespecific designs also reduce the number of training samples per scale, which degrades performance. On the other hand, training a single object detector with all training samples makes the learning task significantly harder because the network needs to learn filters for detecting object instances over a wide range of scales.

我们在ImageNet和COCO上设计控制实验来寻求这些问题的答案。在第3节中,我们通过检查用于ImageNet分类的现有网络的性能来研究尺度变化的影响,当不同尺度的图像被提供作为输入时。我们还对CNN体系结构进行了细微的修改,以对不同比例的图像进行分类。这些实验揭示了上采样对于小物体检测的重要性。为了分析尺度变化对物体检测的影响,我们在第5节中对尺度特性和尺度不变检测器设计的性能进行了训练和比较。对于尺度特定的检测器,通过对每个尺度范围训练单独的检测器来处理尺度变化。此外,在类似规模的物体实例上训练检测器作为预先训练的分类网络有助于减少检测器backbone的domain shift。但是,scale pecific设计也会减少每个规模的训练样本数量,从而降低性能。另一方面,使用所有训练样本对单个物体检测器进行训练会使学习任务变得非常困难,因为网络需要学习用于检测范围广泛的物体实例的滤波器。

Based on these observations, in Section 6 we present a novel training paradigm, which we refer to as Scale Normalization for Image Pyramids (SNIP), that benefits from reducing scale-variation during training but without paying the penalty of reduced training samples. Scaleinvariance is achieved using an image-pyramid (instead of a scale-invariant detector), which contains normalized input representations of object instances in one of the scales in the image-pyramid. To minimize the domain shift for the backbone CNN, we only back-propagate gradients for RoIs/anchors that have a resolution close to that of the pretraining dataset. Since we train on each scale in the pyramid with the above constraint, SNIP effectively utilizes all the object instances available during training. The proposed approach is generic and can be plugged into the training pipeline of different problems like instance-segmentation, pose-estimation, spatio-temporal action detection wherever the “objects” of interest manifest large scale variations.

基于这些观察,在第6节中,我们提出了一种新颖的训练范式,我们称之为Scale Normalization for Image Pyramids(SNIP),它受益于减少训练期间的尺度变化,但不支付减少训练样本的惩罚。 Scale invariance是使用图像金字塔(而不是尺度不变检测器)实现的,该图像金字塔包含物体实例的标准化输入表示。为了最小化backbone CNN的domain shift,我们仅向具有接近预训练数据集分辨率的RoI /anchors的反向传播梯度。由于我们在上述约束条件下训练金字塔中的每个尺度,SNIP有效地利用了训练期间可用的所有物体实例。所提出的方法是通用的,并且可将其插入到不同问题的训练流程中,例如 instance-segmentation, pose-estimation, spatio-temporal action detection ,无论感兴趣的“物体”表现出大规模变化。

Contrary to the popular belief that deep neural networks can learn to cope with large variations in scale given enough training data, we show that SNIP offers significant improvements (3.5%) over traditional object detection training paradigms. Our ensemble of Image Pyramid Networks with a Deformable-RFCN backbone obtains an mAP of 69.7% at 50% overlap, which is an improvement of 7.4% over the state-of-the-art on the COCO dataset.

与普遍认为深层神经网络可以学习如何处理大规模变化(给定足够的训练数据)相反,我们证明SNIP比传统物体检测训练范例提供了显着的改进(3.5%)。我们的具有Deformable-RFCN backbone的图像金字塔网络集合在50%重叠处获得69.7%的mAP,比COCO数据集的现有技术水平提高了7.4%。

Scale space theory [34, 24] advocates learning representations that are invariant to scale and the theory has been applied to many problems in the history of computer vision [4, 28, 26, 19, 12, 5, 21]. For problems like object detection, pose-estimation, instance segmentation etc., learning scale invariant representations is critical for recognizing and localizing objects. To detect objects at multiple scales, many solutions have been proposed.


The deeper layers of modern CNNs have large strides (32 pixels) that lead to a very coarse representation of the input image, which makes small object detection very challenging. To address this problem, modern object detectors [30, 6, 5] employ dilated/atrous convolutions to increase the resolution of the feature map. Dilated/deformable convolutions also preserve the weights and receptive fields of the pre-trained network and do not suffer from degraded performance on large objects. Up-sampling the image by a factor of 1.5 to 2 times during training and up to 4 times during inference is also a common practice to increase the final feature map resolution [7, 6, 13]. Since feature maps of layers closer to the input are of higher resolution and often contain complementary information (wrt. conv5), these features are either combined with shallower layers (like conv4, conv3) [21, 29, 1, 29] or independent predictions are made at layers of different resolutions [35, 25, 3]. Methods like SDP [35], SSH [27] or MS-CNN [3], which make independent predictions at different layers, also ensure that smaller objects are trained on higher resolution layers (like conv3) while larger objects are trained on lower resolution layers (like conv5). This approach offers better resolution at the cost of high-level semantic features which can hurt performance.

现代CNN的更深层次具有很大的stride(32像素),导致输入图像的非常粗糙的表示,这使得小物体检测非常具有挑战性。为了解决这个问题,模型物体检测器[30,6,5]采用dilated/atrous卷积来增加特征图的分辨率。 Dilated/deformable 卷积还保留了预先训练的网络的权重和感受野,并且不会受到大型物体性能下降的影响。在训练过程中对图像进行1.5倍至2倍的上采样,并在推理过程中最多4次提高最终特征图分辨率[7,6,13]。由于更接近输入层的特征图具有较高的分辨率并且通常包含补充信息(wrt。conv5),因此这些特征要么与较浅的图层(如conv4,conv3)结合[21,29,1,29],要么独立预测是在不同分辨率的层上制作的[35,25,3]。诸如SDP [35],SSH [27]或MS-CNN [3]等在不同层次进行独立预测的方法也确保较小的物体在较高分辨率层(如conv3)上训练,而较大物体在较低分辨率层(如conv5)。这种方法以高级语义特征为代价提供了更好的分辨率,这会损害性能。

Methods like FPN, Mask-RCNN, RetinaNet [21, 11, 22], which use a pyramidal representation and combine features of shallow layers with deeper layers at least have access to higher level semantic information. However, if the size of an object was 25x25 pixels then even an up-sampling factor of 2 during training will scale the object to only 50x50 pixels. Note that typically the network is pre-trained on images of resolution 224x224. Therefore, the high level semantic features (at conv5) generated even by feature pyramid networks will not be useful for classifying small objects (a similar argument can be made for large objects in high resolution images). Hence, combining them with features from shallow layers would not be good for detecting small objects, see Fig. 2. Although feature pyramids efficiently exploit features from all the layers in the network, they are not an attractive alternative to an image pyramid for detecting very small/large objects.

诸如FPN,Mask-RCNN,RetinaNet [21,11,22]等使用金字塔表示并将浅层与深层结合的特征的方法至少能够获得更高级别的语义信息。但是,如果物体的大小为25x25像素,那么即使训练期间的上采样因子为2,也会将物体缩放为仅50x50像素。请注意,通常情况下,网络预先在224x224分辨率的图像上进行了训练。因此,即使通过特征金字塔网络生成的高级语义特征(在conv5处)也不会用于对小物体进行分类(对于高分辨率图像中的大物体可以进行类似的论证)。因此,将它们与来自浅层的特征组合起来对于检测小物体并不会有好处,请参见图2.尽管特征金字塔可以有效地利用网络中所有图层的特征,但它们不像图像金字塔那样具有吸引力,小/大的物体。

Recently, a pyramidal approach was proposed for detecting faces [15] where the gradients of all objects were back-propagated after max-pooling the responses from each scale. Different filters were used in the classification layers for faces at different scales. This approach has limitations for object detection because training data per class in object detection is limited and the variations in appearance, pose etc. are much larger compared to face detection. We on the other hand, selectively back-propagate gradients for each scale and use the same filters irrespective of the scale of the object, thereby making better use of training data. We observe that adding scale specific filters in R-FCN for each class hurts performance for object detection. In [31], an image pyramid was generated and maxout [10] was used to select features from a pair of scales closer to the resolution of the pre-trained dataset during inference: however, standard multi-scale training (described in Section 5) was used.

最近,人们提出了一种用于检测人脸的金字塔方法[15],其中所有物体的梯度在从每个比例最大化汇总响应之后被反向传播。不同规模的人脸分类层使用不同的滤波器。这种方法对于目标检测有限制,因为与目标检测相关的每个类的训练数据是有限的,并且与人脸检测相比,外观,姿态等的变化要大得多。另一方面,我们选择性地反向传播每个尺度的梯度,并使用相同的滤波器,而不考虑物体的尺度,从而更好地利用训练数据。我们观察到,为每个类添加R-FCN中的比例特定滤波器会损害物体检测的性能。在文献[31]中,生成图像金字塔,maxout [10]用于在推理过程中从一对更接近预训练数据集分辨率的尺度中选择特征:但是,标准多尺度训练(在第5节)被使用。

We presented an analysis of different techniques for recognizing and detecting objects under extreme scale variation, which exposed shortcomings of the current object detection training pipeline. Based on the analysis, a training scheme (SNIP) was proposed to tackle the wide scale spectrum of object instances which participate in training and to reduce the domain-shift for the pre-trained classification network. Compared to a single-scale detector, SNIP obtains a 5% improvement in mAP, which highlights the importance of scale and image-pyramids in object detection.


Figure 2. The same layer convolutional features at different scales of the image are different and map to different semantic regions in the image at different scales.


Figure 3. Both CNN-B and CNN-B-FT are provided an upsampled low resolution image as input. CNN-S is provided a low resolution image as input. CNN-B is trained on high resolution images. CNN-S is trained on low resolution images. CNN-B-FT is pretrained on high resolution images and fine-tuned on upsampled low-resolution images.

图3. CNN-B和CNN-B-FT都提供了一个上采样的低分辨率图像作为输入。 CNN-S提供一个低分辨率图像作为输入。 CNN-B接受高分辨率图像训练。 CNN-S接受低分辨率图像的训练。 CNN-B-FT在高分辨率图像上进行预训练,并对上采样的低分辨率图像进行微调。

Figure 6. SNIP training and inference for IPN is shown. Invalid RoIs which fall outside the specified range at each scale are shown in purple. These are discarded during training and inference. Each batch during training consists of images sampled from a particular scale. Invalid GT boxes are used to invalidate anchors in RPN. Detections from each scale are rescaled and combined using NMS.


转载请注明原文地址: https://www.6miu.com/read-2000290.html