基于多模态特征学习的人体行为识别方法
作者:
基金项目:

国家自然科学基金(61672337)


Human Action Recognition Algorithm Based on Multi-Modal Features Learning
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [25]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    由于从单一行为模态中获取的特征难以准确地表达复杂的人体动作, 本文提出基于多模态特征学习的人体行为识别算法. 首先采用两条通道分别提取行为视频的RGB特征和3D骨骼特征, 第1条通道C3DP-LA网络由两部分组成: (1) 包含时空金字塔池化(Spatial Temporal Pyramid Pooling, STPP)的改进3D CNN; (2) 基于时空注意力机制的LSTM, 第2条通道为时空图卷积网络(ST-GCN), 然后, 本文将提取到的两种特征融合使其优势互补, 最后用Softmax分类器对融合特征进行分类, 并在公开数据集UCF101和NTU RGB + D上验证. 实验表明, 本文提出的方法与现有行为识别算法相比具有较高的识别准确度.

    Abstract:

    Since the features obtained from a single action mode fail to accurately express complex human actions, this study proposes a recognition algorithm for human actions based on multi-modal feature learning. First, two channels extract the RGB and 3D skeletal features from the action video. The first channel, i.e., the C3DP-LA network, consists of an improved 3D CNN with Spatial Temporal Pyramid Pooling (STPP) and LSTM based on spatial-temporal attention. The second channel is the Spatial-Temporal Graph Convolutional Network (ST-GCN). Then the two extracted features are fused and classified by Softmax. Furthermore, the proposed algorithm is verified on the public data sets UCF101 and NTU RGB+D. The results show that this algorithm has higher recognition accuracy than its counterparts.

    参考文献
    [1] Aggarwal JK, Ryoo MS. Human activity analysis: A review. ACM Computing Surveys, 2011, 43(3): 16. [doi: 10.1145/1922649.1922653
    [2] Yeung S, Russakovsky O, Mori G, et al. End-to-end learning of action detection from frame glimpses in videos. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 2678–2687.
    [3] Liu MY, Yuan JS. Recognizing human actions as the evolution of pose estimation maps. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, UT, USA. 2018.1159–1168.
    [4] Weng JW, Weng CQ, Yuan JS, et al. Discriminative spatio-temporal pattern discovery for 3D action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(4): 1077–1089. [doi: 10.1109/TCSVT.2018.2818151
    [5] Shotton J, Fitzgibbon A, Cook M, et al. Real-time human pose recognition in parts from single depth images. Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Providence, RI, USA. 2011. 1297–1304.
    [6] Vemulapalli R, Arrate F, Chellappa R. Human action recognition by representing 3D skeletons as points in a lie group. Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA. 2014. 588–595.
    [7] Wang PC, Li ZY, Hou YH, et al. Action recognition based on joint trajectory maps using convolutional neural networks. Proceedings of the 24th ACM International Conference on Multimedia Conference. Amsterdam, the Netherlands. 2016. 102–106.
    [8] Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile. 2015. 4489–4497.
    [9] He KM, Zhang XY, Ren SQ, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904–1916. [doi: 10.1109/TPAMI.2015.2389824
    [10] Meng LL, Zhao B, Chang B, et al. Interpretable spatio-temporal attention for video action recognition. arXiv: 1810.04511, 2018.
    [11] Yan SJ, Xiong YJ, Lin DH, et al. Spatial temporal graph convolutional networks for skeleton-based action recognition. Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). New Orleans, LA, USA. 2018. 7444–7452.
    [12] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, QC, Canada. 2014. 568–576.
    [13] Chen C, Liu K, Kehtarnavaz N. Real-time human action recognition based on depth motion maps. Journal of Real-time Image Processing, 2016, 12(1): 155–163. [doi: 10.1007/s11554-013-0370-1
    [14] Lee I, Kim D, Kang S, et al. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy. 2017. 1012–1020.
    [15] Chaaraoui AA, Padilla-López JR, Flórez-Revuelta F. Fusion of skeletal and silhouette-based features for human action recognition with RGB-D devices. Proceedings of IEEE International Conference on Computer Vision Workshops. Sydney, NSW, Australia. 2013. 91–97.
    [16] Sanchez-Riera J, Hua KL, Hsiao YS, et al. A comparative study of data fusion for RGB-D based visual recognition. Pattern Recognition Letters, 2016, 73: 1–6. [doi: 10.1016/j.patrec.2015.12.006
    [17] Li M, Leung H, Shum HPH. Human action recognition via skeletal and depth based feature fusion. Proceedings of the 9th International Conference on Motion in Games. Burlingame, CA, USA. 2016. 123–132.
    [18] Chen C, Jafari R, Kehtarnavaz N. Improving human action recognition using fusion of depth camera and inertial sensors. IEEE Transactions on Human-Machine Systems, 2015, 45(1): 51–61. [doi: 10.1109/THMS.2014.2362520
    [19] Soomro K, Zamir AR, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv: 1212.0402, 2012.
    [20] Shahroudy A, Liu J, Ng TT, et al. NTU RGB+D: A large scale dataset for 3D human activity analysis. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA. 2016. 1010–1019.
    [21] Bazzani L, Larochelle H, Torresani L. Recurrent mixture density network for spatiotemporal visual attention. arXiv: 1603.08199, 2016.
    [22] Si CY, Jing Y, Wang W, et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning. Proceedings of the 15th European Conference on Computer Vision. Munich, Germany. 2019. 106–121.
    [23] Baradel F, Wolf C, Mille J. Human action recognition: Pose-based attention draws focus to hands. Proceedings of 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). Venice, Italy. 2017. 604–613.
    [24] Baradel F, Wolf C, Mille J. Human activity recognition with pose-driven attention to RGB. Proceedings of the 29th British Machine Vision Conference (BMVC). Newcastle, UK. 2018. 1–14.
    [25] Wang PC, Li WQ, Wan J, et al. Cooperative training of deep aggregation networks for RGB-D action recognition. Proceedings of 32nd AAAI Conference on Artificial Intelligence(AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18). New Orleans, LA, USA. 2018. 7404–7411.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

周雪雪,雷景生,卓佳宁.基于多模态特征学习的人体行为识别方法.计算机系统应用,2021,30(4):146-152

复制
分享
文章指标
  • 点击次数:976
  • 下载次数: 3105
  • HTML阅读次数: 2926
  • 引用次数: 0
历史
  • 收稿日期:2020-08-25
  • 最后修改日期:2020-09-15
  • 在线发布日期: 2021-03-31
文章二维码
您是第11208049位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号