计算机系统应用  2019, Vol. 28 Issue (7): 234-239

Human Action Recognition Algorithm Based on Two-Stream Convolutional Networks
LIU Yun, ZHANG Kun, WANG Chuan-Xu
Information Science and Technology Academy, Qingdao University of Science and Technology, Qingdao 266061, China
Foundation item: National Natural Science Foundation of China (61472196, 61672305)
Abstract: Given a long, untrimmed video consisting of multiple action instances and complex background contents, temporal action detection needs not only to recognize their action categories, but also to localize the start time and end time of each instance. To this end, a temporal action detection network based on two-stream convolutional networks is proposed. First, the two-stream convolutional networks is used to extract the feature sequence of the video, and then TAG (Temporal Actionness Grouping) is used to generate the proposal. In order to construct high-quality proposals, the proposal is feed to the boundary regression network to correct the boundary and make it closer to the ground truth, then extend the proposal to a three-segment feature design with context information, and finally use a multi-layer perception to identify behavior. The experimental results show that the proposed algorithm achieves a great mAP in the THUMOS 2014 dataset and the ActivityNet v1.3 dataset.
Key words: human action recognition     two-stream convolutional networks     deep learning     temporal action localization

1 引言

2 识别模型

 图 1 基于双流卷积神经网络的人体行为识别模型

2.1 问题描述

2.2 特征序列提取

2.3 行为提议

2.4 边界回归

 ${o_s} = {s_{clip}} - {s_{gt}},{o_e} = {e_{clip}} - {e_{gt}}$ (1)

 $p_c^K = [t_s^K,t_e^K]$ (2)
2.5 提议特征

 图 2 边界回归网络处理行为提议边界

 图 3 行为提议特征构建

2.6 行为分类

 $L{\rm{ = }}{L_{cls}} + \lambda {L_{reg}}$ (3)

 ${L_{reg}} = \frac{1}{N}\sum\limits_{i = 1}^N {\sum\limits_{z = 1}^n {l_i^z} } \left[ {R\left( {o_{s,i}^{\prime z} - o_{s,i}^z} \right) + R\left( {o_{e,i}^{\prime 2} - o_{e,i}^z} \right)} \right]$ (4)

3 实验

3.1 数据集

ActivityNet v1.3[1]是一个用于时序行为检测的大型数据集, 其中包含19994个带有200类动作标注的长视频, 在2017年和2018年的ActivityNet挑战中使用了该数据集. ActivityNet按照2: 1: 1的比例分为训练集、验证集和测试集.

THUMOS 2014[2]有1010个视频用于验证, 1574个视频用于测试. 这些视频中包含20类带有行为标注的目标动作. 该数据集没有训练集, 使用UCF101数据集作为训练集. 由于训练集没有提供时间注释, 本文在验证集上训练模型并在测试集上进行实验测试. 因此将带有20类行为标注的220个视频用于训练. 在本文的实验中, 将本文提出的方法与THUMOS 2014和ActivityNet v1.3上的现有技术进行比较, 并进行结果分析.

3.2 实验网络参数设置

3.3 实验结果分析

4 结论与展望

