计算机系统应用  2019, Vol. 28 Issue (5): 42-48 PDF

Human Action Recognition Based on Visual Attention
KONG Yan, LIANG Hong, ZHANG Qian
College of Computer & Communication Engineering,, China University of Petroleum, Qingdao 266580, China
Foundation item: Special Fund for Innovation Method by Ministry of Science and Technology of the People’s Republic of China (2015IM010300)
Abstract: Recognition of human actions in videos is an important research field in computer vision in recent years. However, existing methods have insufficient representation of video and cannot focus on significant areas within the image. We propose a deep convolutional neural network based on visual attention, which can effectively add a weight to the video representation features, pay attention to the beneficial regions in the features, and achieve more accurate behavior recognition. We conducted experiments on HMDB51 and our own Oilfield-7 dataset to verify the validity of the model proposed for human actions on the oilfield. The experimental results show that the proposed method has certain advantages compared with the two-stream architectures which have achieved excellent performance.
Key words: action recognition     two-stream architecture     Convolutional Neural Network (CNN)     video representation     visual attention

1 引言

2 相关工作

3 视觉注意力深度卷积网络

3.1 时态段网络

 $\begin{split} &TSN({T_1},{T_2}, \cdots ,{T_k}) =\\ &H(G(F({T_1};W), F({T_2};W), \cdots ,F({T_k};W)))\end{split}$ (1)

 $L(y,G) = - \sum\limits_{i = 1}^C {{y_i}} ({G_i} - \log\sum\limits_{j = 1}^C {\exp } {G_j})$ (2)

3.2 模型架构

AttConv-net分别对双流中的空间网和时态网所提取的特征分配较大的权重, 使其容易地定位到感兴趣地区域, 从而可以更准确进行分类. 该结构如图1所示, 采用双流模型基础架构, 分为空间流网络和时态流网络. 本文的AttConv-net是在TSN的基础上进行了修改, 将注意力模型分别连接到空间网和时态网的最后一个卷积层所提取出的特征上, 之后将分配了权重的特征送入全连接层以及Softmax进行双流网络各自的类别概率的预测, 并且在评判最终视频所属类别之前会将空间流和时态流的网络结果进行合并. 给定一个完整视频V, 将其处理成一系列的片段 ${S_i}\left( {i = 1,2,\cdots k} \right)$ , $k$ 是一整个视频均等分的数量, 每个片段包含一帧RGB图和两帧光流图. 卷积神经网络CNNs分别提取RGB图的全局视觉特征 ${F_{\rm{RGB}}} = \left( {{F_1},{F_2},}\right.$ $\left.{{F_3},\cdots,{F_L}} \right)$ 和光流图的全局视觉特征 ${F_{\rm{OF}}} = \left( {{F_1},{F_2},{F_3},}\right.$ $\left.{\cdots,{F_L}} \right), L$ 表示每张图像划分为了L块区域, 每个区域都是一个 $m$ 维的向量. 融入注意力机制处理后得到特征 ${F_{att{\rm{RGB}}}}$ ${F_{att{\rm{OF}}}}$ , 之后便会得到每个片段 ${S_i}$ 的双流网络中的所属类别得分 ${C_{Si}}$ ${C_{Ti}}$ , 经过共识函数 $G\left( {} \right)$ 后将双流结果送入Softmax函数算概率, 进而得到一个完整视频的分类结果 $W$ . 其中的工作流程可以概括为下列共识:

 ${F_{att{\rm{RGB}}}} = f\left( {{F_{\rm{RGB}}}} \right)$ (3)
 ${F_{att{\rm{OF}}}} = f\left( {{F_{\rm{OF}}}} \right)$ (4)
 ${g_S} = G\left( {\sum\limits_{i = 1}^k {{C_{Si}}} } \right)$ (5)
 ${g_T} = G\left( {\sum\limits_{i = 1}^k {{C_{Ti}}} } \right)$ (6)
 $W ={ \rm{Softmax}}\left( {{g_S},{g_T}} \right)$ (7)

 图 1 AttConv-net模型结构图

3.3 视觉注意力模型

AttConv-net中的注意力模型将从最后一个卷积层输出的特征向量附加一个介于0和1之间的权重, 以此聚焦于图像中的显著区域, 该模型结构如图2所示, 将视频片段输入到网络中, 空间流和时态流分别进行各自的卷积, 图中的虚框部分表示空间流和时态流进行相同的Attention处理, 输出的分数是两流网络的单独得分. 卷积神经网络提取的空间流特征 $F_{\rm{RGB}}^t$ 和时态流特征 $F_{\rm{OF}}^t$ 都是一个 $L \times m$ 维的向量, 即图像有 $L$ 个区域, 每个区域用 $m$ 维的特征向量表示:

 $F_{\rm{RGB/OF}}^t = \left\{ {F_1^t,F_2^t,F_3^t,\cdots, F_L^t} \right\},{F_i} \in {R^m},t = \left( {1,2,\cdots, k} \right)$ (8)

 $\alpha _i^t = {O_{att}}\left( {F_{\rm{RGB/OF}}^t} \right)$ (9)

 $\alpha _n^t = \frac{{\exp \left( {\alpha _i^t} \right)}}{{\sum\limits_{n = 1}^L {\exp \left( {\alpha _n^t} \right)} }}$ (10)

 ${F_{att{\rm{RGB/OF}}}} = \sum\limits_{n = 1}^L {\alpha _n^t{F_{\rm{RGB/OF}}}}$ (11)

AttConv-net之后将 ${F_{att{\rm{RGB/OF}}}}$ 送入全连接层. 融入注意力机制的网络仍然是可以通过标准的反向传播来优化学习.

 图 2 AttConv-net网络结构图

4 实验

4.1 油田人员行为数据集

4.2 实验细节

4.3 结果与分析

5 结论与展望

 图 3 Oilfield-7数据集部分行为注意力变化的可视化图像

 [1] He KM, Zhang XY, Ren SQ, et al. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. Las Vegas, NV, USA. 2016. 770–778. [2] He KM, Zhang XY, Ren SQ, et al. Identity mappings in deep residual networks. European Conference on Computer Vision. Springer. The Netherlands. 2016. 630–645. [3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2014. [4] Szegedy C, Ioffe S, Vanhoucke V, et al. Alemi. Inception-v4, inception-ResNet and the impact of residual connections on learning. arXiv:1602.07261, 2017. [5] Nguyen TV, Song Z, Yan SC. STAP: Spatial-temporal attention-aware pooling for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2015, 25(1): 77-86. DOI:10.1109/TCSVT.2014.2333151 [6] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. arXiv: 1406.2199, 2014. [7] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile. 2015. 4489–4497. [8] Ji SW, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231. DOI:10.1109/TPAMI.2012.59 [9] Girdhar R, Ramanan D, Gupta A, et al. ActionVLAD: Learning spatio-temporal aggregation for action classification. 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 3165–3174. [10] Laptev I, Marszalek M, Schmid C, et al. Learning realistic human actions from movies. 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA. 2008. 1–8. [11] Wang H, Ullah MM, Klaser A, et al. Evaluation of local spatio-temporal features for action recognition. BMVC 2009-British Machine Vision Conference. London, UK. 2009. 124.1–124.11. [12] Wang H, Kläser A, Schmid C, et al. Action recognition by dense trajectories. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2011). Providence, RI, USA. 2011. 3169–3176. [13] Wang H, Schmid C. Action recognition with improved trajectories. Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia. 2013. 3551–3558. [14] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA. 2014. 1725–1732. [15] Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 1933–1941. [16] Ng JYH, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 4694–4702. [17] Wang LM, Xiong YJ, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition. European Conference on Computer Vision. The Netherlands. 2016. 20–36. [18] Carreira J, Zisserman A. Quo Vadis, Action recognition? A new model and the kinetics dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, USA. 2017. 4724–4733. [19] Hou XD, Zhang LQ. Dynamic visual attention: Searching for coding length increments. Advances in Neural Information Processing Systems. 2009: 681–688. [20] Mathe S, Sminchisescu C. Dynamic eye movement datasets and learnt saliency models for visual action recognition. In: Fitzgibbon A, Lazebnik S, Perona P, et al, eds. Computer Vision–ECCV 2012. Berlin, Heidelberg: Springer, 2012. 842–856. [21] Wang LM, Xiong YJ, Wang Z, et al. Towards good practices for very deep two-stream ConvNets. arXiv: 1507.02159, 2015. [22] Idrees H, Zamir AR, Jiang YG, et al. The THUMOS challenge on action recognition for videos " in the Wild”. Computer Vision and Image Understanding, 2017, 155: 1-23. DOI:10.1016/j.cviu.2016.10.018 [23] Kuehne H, Jhuang H, Garrote E, et al. HMDB: A large video database for human motion recognition. 2011 International Conference on Omputer Vision. Barcelona, Spain. 2011. 2556–2563. [24] Soomro K, Zamir AR, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv: 1212.0402, 2012. [25] Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv: 1502.03167, 2015. [26] Wedel A, Pock T, Zach C, et al. An Improved Algorithm for TV-L1 optical flow. In: Cremers D, Rosenhahn B, Yuille A L, et al, eds. Statistical and Geometrical Approaches to Visual Motion Analysis. Berlin, Heidelberg: Springer, 2009. 23–24. [27] Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA. 2009. 248–25.