﻿ 基于改进CNN特征的场景识别
 计算机系统应用  2018, Vol. 27 Issue (12): 25-32 PDF

Scene Recognition Algorithm Using Advanced CNN Features
BO Kang-Hu, LEE Fei-Fei, CHEN Qiu
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
Foundation item: Distinguished Professor (Oriental Scholar) Plan in Universities of Shanghai (ES2015XX)
Abstract: With the development of artificial intelligence, scene recognition has attracted more and more researchers’ attention, which is one of the important directions of computer vision research. The traditional manual features cannot sufficiently describe the characteristics of the scene images, which leading to unsatisfied performance. On the contrary, the features extracted from Convolutional Neural Networks (CNN) contain rich semantics and structural information of the scene images. As one of the most common architectures, AlexNet network model is chosen in this study. By improving the following 4 aspects of the network: depth, width, multi-scale extraction, and multilayer fusion, the proposed approach achieves high accuracies of 92.0% and 94.5% on two publicly available datasets respectively, showing the superiority compared with other methods.
Key words: scene recognition     computer vision     Convolutional Neural Networks (CNN)     AlexNet

1 卷积神经网络

 图 1 标准的卷积神经网络框架

1.1 卷积层

 $h(t) = (x * \omega )(t) = \sum\nolimits_{a = - \infty }^\infty {x(a)\omega (t - a)}$ (1)

1.2 池化层

 图 2 最大池化示例

1.3 全连接层

2 改进CNN网络模型

2.1 基本原则

(1) 避免早期网络阶段的表达瓶颈问题

(2) 平衡模型的深度、宽度以及卷积核大小的关系

(3) 降低模型复杂度

① 时间复杂度

 $Times \sim {\rm{ O}}(\mathop M\nolimits^2 \cdot \mathop K\nolimits^2 \cdot \mathop C\nolimits_{\rm{in}} \cdot \mathop C\nolimits_{\rm{out}} )$ (2)

 $M = (X - K + 2 * padding)/Stride + 1$ (3)

 $Times \sim { O}(\sum\nolimits_{l = 1}^D {\mathop M\nolimits_l^2 } \cdot \mathop K\nolimits_l^2 \cdot \mathop C\nolimits_{l - 1} \cdot \mathop C\nolimits_l )$ (4)

② 空间复杂度(模型的参数尺度)

 $Space \sim { \rm{O}}(\sum\nolimits_{l = 1}^D {\mathop K\nolimits_l^2 \cdot \mathop C\nolimits_{l - 1} } \cdot \mathop C\nolimits_l )$ (5)

③ 神经元数量

2.2 增强的AlexNet网络模型

 图 3 AlexNet学习框架

 图 4 场景图像的布局信息示例

 图 5 场景图像分割示例

(1) 算法1. 改变滤波器的数量.

(2) 算法2. 加深网络的深度.

 图 6 滤波器数量对识别精度的影响

 图 7 网络模型深度对识别精度的影响

(3) 算法3. 多尺度化特征提取.

(4) 算法4. 多层特征融合.

 图 8 多尺度化提取结构

 图 9 多层特征融合学习框架

 图 10 多层特征融合学习中Block块的选取

3 实验与分析 3.1 数据集

(1) 数据集的选取

 图 11 Scene15数据集图像示例

(2) 图像预处理

3.2 实验设置

(1) 实验平台

(2) 参数设置

(3) 采取微调(Fine-tuning)的方式训练模型

 图 12 微调AlexNet模型效果图

3.3 实验与分析

4 结论与展望

 图 13 改进模型在eight sports数据集上的混淆矩阵

 [1] Koskela M, Laaksonen J. Convolutional network features for scene recognition. Proceedings of the 22nd ACM Interna-tional Conference on Multimedia. Orlando, FL, USA. 2014. 1169–1172. [2] Li T, Mei T, Kweon IS, et al. Contextual bag-of-words for visual categorization. IEEE Transactions on Circuits and Systems for Video Technology, 2011, 21(4): 381-392. DOI:10.1109/TCSVT.2010.2041828 [3] Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, NY, USA. 2006. 2169–2178. [4] Menze BH, Van Leemput K, Lashkari D, et al. A generative probabilistic model and discriminative extensions for brain lesion segmentation—with application to tumor and stroke. IEEE Transactions on Medical Imaging, 2016, 35(4): 933-946. DOI:10.1109/TMI.2015.2502596 [5] Yu J, Rui Y, Tao DC. Click prediction for web image reranking using multimodal sparse coding. IEEE Transac-tions on Image Processing, 2014, 23(5): 2019-2032. DOI:10.1109/TIP.2014.2311377 [6] He KM, Sun J. Convolutional neural networks at constrained time cost. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 5353–5360. [7] Zeiler MD, Fergus R. Visualizing and understanding convolu-tional networks. Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland. 2014. 818–833. [8] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2014. [9] Jia YQ, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding. arXiv: 1408.5093, 2014. [10] Vedaldi A, Fulkerson B. VLFeat: An open and portable library of computer vision algorithms. Proceedings of the 18th ACM International Conference on Multimedia. Firenze, Italy. 2010. 1469–1472. [11] Jiang YN, Yuan JS, Yu G. Randomized spatial partition for scene recognition. Proceedings of the 12th European Conference on Computer Vision. Florence, Italy. 2012. 730–743. [12] Zhou BL, Lapedriza A, Xiao JX, et al. Learning deep features for scene recognition using places database. Procee-dings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada. 2014. 487–495. [13] Wu JX, Rehg JM. Beyond the Euclidean distance: Creating effective visual codebooks using the histogram intersection kernel. Proceedings of 2009 IEEE 12th International Conf-erence on Computer Vision. Kyoto, Japan. 2009. 630–637. [14] Banerji S, Sinha A, Liu CJ. A new bag of words LBP (BoWL) descriptor for scene image classification. Proceedings of the 15th International Conference on Computer Analysis of Images and Patterns. York, UK. 2013. 490–497. [15] Wang JJ, Yang JC, Yu K, et al. Locality-constrained linear coding for image classification. Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA. 2010. 3360–3367. [16] Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. Proceedings of the 5th Annual Workshop on Computational Learning Theory. Pittsburgh, PA, USA. 1992. 144–152.