基于3D-SVD的时空行为定位算法
作者:
基金项目:

上海市科委项目(19511132000)


Spatio-Temporal Action Localization Algorithm Based on 3D-SVD
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [29]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    随着摄像头的普及, 基于人工智能的行为分析技术在智能视频领域扮演着越来越重要的角色. 现有的算法大多采用光流网络或者3D网络来获取行为的时间信息, 但是光流网络和一般的3D网络计算量大, 在同时进行分类和定位这两项任务时计算效率低. 针对这一问题, 本文构建了一个能够进行空间定位和分类的双流框架, 在3D网络分支中采用SVD的思想分解3D卷积核以减少3D网络的参数, 并利用动态规划算法高效的搜索最佳行为管道, 在训练的过程中采用mixup算法对数据集进行扩充, 增强训练的效果. 最后, 在UCF101-24和J-HMDB-21这两个被广泛使用的行为定位数据集上进行了实验验证, 相比于基线算法, 两个数据集的Frame-mAP分别提高了7.1%和4.8%, 其中, J-HMDB-21在不同IOU下的Video-mAP分别提高了5.2%和4.8%. 实验结果表明, 本文提出的算法能有效提高行为定位能力, 与其它算法相比获得了较好的结果.

    Abstract:

    With the popularity of video surveillance, action analysis technology based on artificial intelligence is playing an increasingly important role in the field of intelligent surveillance. Most of the existing algorithms depend on an optical flow network or a 3D network to obtain the time information of actions. However, the optical flow network and the general 3D network require a large amount of computation, and the computational efficiency is low when classification and localization are carried out simultaneously. To solve this problem, this study builds a dualflow framework capable of spatial localization and classification and follows the idea of SVD to decompose the 3D convolution kernel in the branch of the 3D network, thus reducing the 3D network parameters. In addition, the dynamic programming algorithm is employed to efficiently search the optimal action tubes, and the mixup algorithm is used to expand the data sets during training, thereby enhancing the training results. Finally, experimental verification is carried out on UCF101-24 and J-HMDB-21, two widely used data sets for action localization. Compared with the baseline algorithm, the Frame-mAP of the two data sets is improved by 7.1% and 4.8%, and the Video-mAP of J-HMDB-21 under different IoUs is enhanced by 5.2% and 4.8%. Experimental results show that the proposed algorithm can substantially improve the ability of action localization, with better results compared with other algorithms.

    参考文献
    [1] Weinzaepfel P, Harchaoui Z, Schmid C. Learning to track for spatio-temporal action localization. Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015. 3164–3172.
    [2] Peng XJ, Schmid C. Multi-region two-stream R-CNN for action detection. Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016. 744–759.
    [3] Yang ZH, Gao JY, Nevatia R. Spatio-temporal action detection with cascade proposal and location anticipation. arXiv: 1708.00042, 2017.
    [4] Alwando EHP, Chen YT, Fang WH. CNN-based multiple path search for action tube detection in videos. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(1): 104–116. [doi: 10.1109/TCSVT.2018.2887283
    [5] Hou R, Chen C, Shah M. Tube Convolutional Neural Network (T-CNN) for action detection in videos. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017. 5823–5832.
    [6] Kalogeiton V, Weinzaepfel P, Ferrari V, et al. Action tubelet detector for spatio-temporal action localization. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017. 4415–4423.
    [7] Li D, Qiu ZF, Dai Q, et al. Recurrent tubelet proposal and recognition networks for action detection. Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018. 306–322.
    [8] He JW, Deng ZW, Ibrahim MS, et al. Generic tubelet proposals for action localization. Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe: IEEE, 2018. 343–351.
    [9] Qiu ZF, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017. 5534–5542.
    [10] Tran D, Wang H, Torresani L, et al. A closer look at spatiotemporal convolutions for action recognition. Proceedings of 2018 IEEE/CVF Conference on COMPUTER Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6450–6459.
    [11] Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002, 16: 321–357. [doi: 10.1613/jair.953
    [12] Inoue H. Data augmentation by pairing samples for images classification. arXiv: 1801.02929, 2018.
    [13] Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks. Advances in Neural Information Processing Systems, 2014, 3: 2672–2680
    [14] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014. 580–587.
    [15] Ren SQ, He KM, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [doi: 10.1109/TPAMI.2016.2577031
    [16] Girshick R. Fast R-CNN. Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015. 1440–1448.
    [17] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 779–788.
    [18] Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multibox detector. Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016. 21–37.
    [19] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal: NIPS, 2014. 568–576.
    [20] Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6546–6555.
    [21] Köpüklü O, Wei XY, Rigoll G. You only watch once: A unified CNN architecture for real-time spatiotemporal action localization. arXiv: 1911.06644, 2019.
    [22] Pramono RRA, Chen YT, Fang WH. Hierarchical self-attention network for action localization in videos. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019. 61–70.
    [23] Redmon J, Farhadi A. YOLOv3: An incremental improve-ment. arXiv: 1804.02767, 2018.
    [24] Li C, Zhong QY, Xie D, et al. Collaborative spatiotemporal feature learning for video action recognition. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 7864–7873.
    [25] Touvron H, Vedaldi A, Douze M, et al. Fixing the train-test resolution discrepancy. Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver: NeurIPS, 2019. 8250–8260.
    [26] Yang XT, Yang XD, Liu MY, et al. STEP: Spatio-TEmporal Progressive learning for video action detection. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 264–272.
    [27] Singh G, Saha S, Sapienza M, et al. Online real-time multiple spatiotemporal action localisation and prediction. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017. 3657–3666.
    [28] Saha S, Singh G, Sapienza M, et al. Deep learning for detecting multiple space-time action tubes in videos. arXiv: 1608.01529, 2016.
    [29] Singh G, Saha S, Cuzzolin F. Predicting action tubes. Proceedings of European Conference on Computer Vision. Cham: Springer, 2018. 106–123.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王紫烟,张立华,翟鹏,杜洋涛.基于3D-SVD的时空行为定位算法.计算机系统应用,2021,30(10):109-117

复制
分享
文章指标
  • 点击次数:856
  • 下载次数: 2157
  • HTML阅读次数: 1713
  • 引用次数: 0
历史
  • 收稿日期:2021-01-06
  • 最后修改日期:2021-02-07
  • 在线发布日期: 2021-10-08
文章二维码
您是第11206990位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号