计算机系统应用  2001, Vol. 29 Issue (9): 149-155 PDF

SSD Object Detection Algorithm with Feature Enhancement of Receptive Field
TAN Long, GAO Ang
School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China
Foundation item: General Program of National Natural Science Foundation of China (81373537); General Program of Natural Science Foundation of Helongjiang Province, China (F201434)
Abstract: SSD (Single Shot multi-box Detector) algorithm is used to detect multi-scale objects on feature maps of different layers, which has the characteristics of fast speed and high accuracy. However, the feature pyramid detection method of traditional SSD algorithm is difficult to fuse the features of different scales, and because the convolutional neural network layer at the bottom has weak semantic information and is not conducive to the recognition of small objects, so this paper proposes a novel object detection algorithm RF_SSD based on the network structure of SSD algorithm. In this algorithm, feature maps of different layers and scales are fused in a lightweight way, and new feature maps are generated in the lower sampling layer. By introducing the receptive field module, the feature extraction ability of the network is improved, and the characterization ability and robustness of the feature are enhanced. Compared with the traditional SSD algorithm, the accuracy of the proposed algorithm is significantly improved, and the real-time performance of object detection is fully guaranteed. The experimental results show that the accuracy is 80.2% and the detection speed is 44.5 FPS on the PASCAL VOC test set.
Key words: SSD algorithm     object detection     convolutional neural network     receptive field     computer vision

(1) 提出了新颖的、轻量级的特征融合方式, 主要是将不同层的特征图合并, 并生成特征金字塔, 降低了重复检测一个对象的多个部分或者多个对象合并到一个对象的检测概率, 同时小物体检测表现更好.

(2) 借鉴混合空洞卷积和Inception结构, 设计并添加感受野模块来增强网络的特征提取能力, 同时在不增加卷积参数的前提下增大卷积感受野, 加强轻量级卷积神经网络学到的深层特征, 保证检测器的实时性.

(3) 在PASCAL VOC数据集上进行了定性与定量的实验, 结果表明, 同传统SSD算法相比, 本文所提出的算法在目标检测性能上有显著的提升, 同时以相对低的速度损耗提高了小物体的准确率.

1 相关工作

2 RF_SSD算法

SSD采用不同尺度的特征图来检测物体, 以VGG16[21]作为骨干网络, 采用级联卷积的方式生成不同尺度的特征图, 结合YOLO的回归思想和Faster-RCNN的Anchor机制, 使用全图各个位置的多尺度区域特征进行回归, 既保证检测速度又保持了精度. 同时在对特征图预测时, 采用卷积核来预测一系列Default Bounding Boxes的类别和坐标偏移.

 图 1 SSD算法结构图

2.1 特征融合(Feature Fusion)

 图 2 特征融合模块

2.2 感受野模块

 图 3 RFM模块

2.3 算法结构

 图 4 本文的算法结构

 $L\left( {x,c,l,g} \right) = \frac{1}{N}\left( {\mathop L\nolimits_{\rm conf} \left( {x,c} \right) + \alpha \mathop L\nolimits_{\rm loc} \left( {x,l,g} \right)} \right)$ (1)

 $\mathop L\nolimits_{\rm loc} \left( {x,l,g} \right) = \sum\limits_{i \in Pos}^N {\sum\limits_{m \in \left\{ {cx,cy,w,h} \right\}} {\mathop x\nolimits_{ij}^k } } \mathop {smooth}\nolimits_{l1} \left( {\mathop l\nolimits_i^m - \mathop {\hat g}\nolimits_j^m } \right)$ (2)
 $\mathop {\hat g}\nolimits_j^{cx} = {{\left( {\mathop g\nolimits_j^{cx} - \mathop d\nolimits_i^{cx} } \right)} / {\mathop d\nolimits_i^w }}$ (3)
 $\mathop {\hat g}\nolimits_j^{cy} = {{\left( {\mathop g\nolimits_j^{cy} - \mathop d\nolimits_i^{cy} } \right)} / {\mathop d\nolimits_i^h }}$ (4)
 $\mathop {\hat g}\nolimits_j^w = \log \left( {\frac{{\mathop g\nolimits_j^w }}{{\mathop d\nolimits_i^w }}} \right)$ (5)
 $\mathop {\hat g}\nolimits_j^h = \log \left( {\frac{{\mathop g\nolimits_j^h }}{{\mathop d\nolimits_i^h }}} \right)$ (6)

 $\mathop {smooth}\nolimits_{l1} \left( x \right) = \left\{ {\begin{array}{*{20}{l}} {\mathop {0.5x}\nolimits^2, \begin{array}{*{20}{l}} {}&{{\rm if}\left| x \right| < 1} \end{array}} \\ {\left| x \right| - 0.5,\begin{array}{*{20}{l}} {}&{\rm otherwise} \end{array}} \end{array}} \right.$ (7)

 $\left\{\begin{array}{l} \mathop L\nolimits_{\rm conf} \left( {x,c} \right) = - \displaystyle \sum\limits_{i \in Pos}^N {\mathop x\nolimits_{ij}^p } \log \left( {\mathop {\hat c}\nolimits_i^p } \right) - \displaystyle \sum\limits_{i \in Neg} {\log \left( {\mathop {\hat c}\nolimits_i^0 } \right)} \\ \begin{array}{*{20}{l}} {\rm where}&{\mathop {\hat c}\nolimits_i^p } \end{array} = \frac{{\exp \left( {\mathop c\nolimits_i^p } \right)}}{{\displaystyle \sum\nolimits_p {\exp \left( {\mathop c\nolimits_i^p } \right)} }} \\ \end{array}\right.$ (8)

3 实验分析 3.1 数据增强

3.2 网络训练策略

3.3 PASCAL VOC2007测试结果分析

PASCAl VOC是一个用于物体分类识别和检测的标准数据集, 该数据集包括20个类别, 表1为PASCAl VOC具体类别.

 图 5 不同的检测算法在检测速度和精度上的分布

4 结论

 图 6 COCO 2017上的实例检测结果

 [1] He KM, Gkioxari G, Dollár P, et al. Mask R-CNN. Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy. 2017. 2980–2988. [2] Zheng YT, Pal DK, Savvides M. Ring loss: Convex feature normalization for face recognition. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 2018. 5089–5097. [3] Dollár P, Wojek C, Schiele B, et al. Pedestrian detection: A benchmark. Proceedings of 2009 IEEE Computer Vision and Pattern Recognition. Miami, FL, USA. 2009. 304–311. [4] Wang XL, Gupta A. Videos as space-time region graphs. Proceedings of the 15th European Conference on Computer Vision. Munich, Germany. 2018. 413–431. [5] Xie SN, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. Proceedings of 2017 Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA. 2017. 5987–5990. [6] Lin TY, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context. Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland. 2014. 740–755. [7] Lowe DG. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91-110. DOI:10.1023/B:VISI.0000029664.99615.94 [8] Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multiBox detector. Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands. 2016. 21–37. [9] Lin TY, Dollár P, Girshick R, et al. Feature pyramid networks for object detection. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 2117–2125. [10] Fu CY, Liu W, Ranga A, et al. DSSD: Deconvolutional single shot detector. arXiv: 1701.06659, 2017. [11] Kong T, Sun FC, Yao AB, et al. Ron: Reverse connection with objectness prior networks for object detection. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 5936–5944. [12] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille, France. 2015. 448–456. [13] He KM, Zhang XY, Ren SQ, et al. Deep residual learning for image recognition. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 770–778. [14] Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA. 2017. 4700–4708. [15] Shrivastava A, Gupta A, Girshick R. Training region-based object detectors with online hard example mining. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA. 2016. 761–769. [16] Singh B, Davis LS. An analysis of scale invariance in object detection–SNIP. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 2018. 3578–3587. [17] Cai ZW, Fan QF, Feris RS, et al. A unified multi-scale deep convolutional neural network for fast object detection. Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands. 2016. 354–370. [18] Redmon J, Farhadi A. YOLOv3: An incremental improvement. arXiv: 1804.02767, 2018. [19] Lin TY, Goyal P, Girshick R, et al. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 318-327. DOI:10.1109/TPAMI.2018.2858826 [20] Zhang SF, Wen LY, Bian X, et al. Single-shot refinement neural network for object detection. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 2018. 4203–4212. [21] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2014. [22] 王伟锋, 金杰, 陈景明. 基于感受野的快速小目标检测算法. 激光与光电子学进展, 2020, 57(2): 021501. [23] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, inception-resnet and the impact of residual connections on learning. Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, CA, USA. 2017. 4278–4284. [24] Chen LC, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation. arXiv: 1706.05587, 2017.