计算机系统应用  2019, Vol. 28 Issue (7): 234-239 PDF

Human Action Recognition Algorithm Based on Two-Stream Convolutional Networks
LIU Yun, ZHANG Kun, WANG Chuan-Xu
Information Science and Technology Academy, Qingdao University of Science and Technology, Qingdao 266061, China
Foundation item: National Natural Science Foundation of China (61472196, 61672305)
Abstract: Given a long, untrimmed video consisting of multiple action instances and complex background contents, temporal action detection needs not only to recognize their action categories, but also to localize the start time and end time of each instance. To this end, a temporal action detection network based on two-stream convolutional networks is proposed. First, the two-stream convolutional networks is used to extract the feature sequence of the video, and then TAG (Temporal Actionness Grouping) is used to generate the proposal. In order to construct high-quality proposals, the proposal is feed to the boundary regression network to correct the boundary and make it closer to the ground truth, then extend the proposal to a three-segment feature design with context information, and finally use a multi-layer perception to identify behavior. The experimental results show that the proposed algorithm achieves a great mAP in the THUMOS 2014 dataset and the ActivityNet v1.3 dataset.
Key words: human action recognition     two-stream convolutional networks     deep learning     temporal action localization

1 引言

2 识别模型

 图 1 基于双流卷积神经网络的人体行为识别模型

2.1 问题描述

2.2 特征序列提取

2.3 行为提议

2.4 边界回归

 ${o_s} = {s_{clip}} - {s_{gt}},{o_e} = {e_{clip}} - {e_{gt}}$ (1)

 $p_c^K = [t_s^K,t_e^K]$ (2)
2.5 提议特征

 图 2 边界回归网络处理行为提议边界

 图 3 行为提议特征构建

2.6 行为分类

 $L{\rm{ = }}{L_{cls}} + \lambda {L_{reg}}$ (3)

 ${L_{reg}} = \frac{1}{N}\sum\limits_{i = 1}^N {\sum\limits_{z = 1}^n {l_i^z} } \left[ {R\left( {o_{s,i}^{\prime z} - o_{s,i}^z} \right) + R\left( {o_{e,i}^{\prime 2} - o_{e,i}^z} \right)} \right]$ (4)

3 实验

3.1 数据集

ActivityNet v1.3[1]是一个用于时序行为检测的大型数据集, 其中包含19994个带有200类动作标注的长视频, 在2017年和2018年的ActivityNet挑战中使用了该数据集. ActivityNet按照2: 1: 1的比例分为训练集、验证集和测试集.

THUMOS 2014[2]有1010个视频用于验证, 1574个视频用于测试. 这些视频中包含20类带有行为标注的目标动作. 该数据集没有训练集, 使用UCF101数据集作为训练集. 由于训练集没有提供时间注释, 本文在验证集上训练模型并在测试集上进行实验测试. 因此将带有20类行为标注的220个视频用于训练. 在本文的实验中, 将本文提出的方法与THUMOS 2014和ActivityNet v1.3上的现有技术进行比较, 并进行结果分析.

3.2 实验网络参数设置

3.3 实验结果分析

4 结论与展望

 [1] Heilbron FC, Escorcia V, Ghanem B, et al. ActivityNet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 961–970. [2] Idrees H, Zamir AR, Jiang YG, et al. The THUMOS challenge on action recognition for videos " in the wild”. Computer Vision and Image Understanding, 2017, 155: 1-23. DOI:10.1016/j.cviu.2016.10.018 [3] Buch S, Escorcia V, Shen CQ, et al. SST: Single-stream temporal action proposals. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 2911–2920. [4] Heilbron FC, Niebles JC, Ghanem B. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 1914–1923. [5] Escorcia V, Heilbron FC, Niebles JC, et al. Daps: Deep action proposals for action understanding. European Conference on Computer Vision. The Netherlands. 2016. 768–784. [6] Gao JY, Yang ZH, Chen K, et al. Turn tap: Temporal unit regression network for temporal action proposals. Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy. 2017. 3628–3636. [7] Shou Z, Wang DG, Chang SF. Temporal action localization in untrimmed videos via multi-stage CNNs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 1049–1058. [8] Yeung S, Russakovsky O, Mori G, et al. End-to-end learning of action detection from frame glimpses in videos. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 2678–2687. [9] De Geest R, Gavves E, Ghodrati A, et al. Online action detection. European Conference on Computer Vision. The Netherlands. 2016. 269–284. [10] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile. 2015. 4489–4497. [11] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada. 2014. 568–576. [12] Zhao Y, Xiong YJ, Wang LM, et al. Temporal action detection with structured segment networks. Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy. 2017. 2914–2923. [13] Pont-Tuset J, Arbeláez P, Barron JT, et al. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(1): 128-140. DOI:10.1109/TPAMI.2016.2537320 [14] Wang L, Qiao Y, Tang X. Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognition Challenge, 2014, 1(2): 2. [15] Oneata D, Verbeek J, Schmid C. The Lear submission at THUMOS 2014. In THUMOS Action Recognition Challenge, 2014. [16] Richard A, Gall J. Temporal action detection using a statistical language model. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 3131–3140. [17] Dai XY, Singh B, Zhang GY, et al. Temporal context network for activity localization in videos. Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy. 2017. 5793–5802. [18] Nguyen P, Han B, Liu T, et al. Weakly supervised action localization by sparse temporal pooling network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 2018. 6752–6761. [19] Singh B, Marks TK, Jones M, et al. A multi-stream Bi-directional recurrent neural network for fine-grained action detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 1961–1970. [20] Heilbron FC, Barrios W, Escorcia V, et al. SCC: Semantic context cascade for efficient action detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 1454–1463. [21] Kong WJ, Li NN, Liu S, et al. BLP-boundary likelihood pinpointing networks for accurate temporal action localization. arXiv preprint arXiv: 1811. 02189, 2018.