基于特征强化与知识补充的视频描述方法
作者:
基金项目:

陕西省科技计划重点项目(2017ZDCXL-GY-05-03)


Video Description Method Combining Feature Reinforcement and Knowledge Supplementation
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [28]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    针对视频描述生成的文本质量不高与不够新颖的问题, 本文提出一种基于特征强化与文本知识补充的编解码模型. 在编码阶段, 该模型通过局部与全局特征强化增强模型对视频中静态物体的细粒度特征提取, 提高了对物体相似语义的分辨, 并融合视觉语义与视频特征于长短期记忆网络(long short-term memory, LSTM); 在解码阶段, 为挖掘视频中不易被机器发现的隐含信息, 截取视频部分帧并检测其中视觉目标, 利用得到的视觉目标从外部知识语库提取知识用来补充描述文本的生成, 以此产生出更新颖更自然的文本描述. 在MSVD与MSR-VTT数据集上的实验结果表明, 本文方法展现出良好的性能, 并且生成的内容信息在一定程度上能够表现出新颖的隐含信息.

    Abstract:

    As texts generated by video descriptions are of low quality and not novel, this study proposes a codec model based on feature reinforcement and text knowledge supplementation. In the coding stage, the model enhances the fine-grained feature extraction of static objects in a video by strengthening local and global features, thus improving the resolution of similar semantics of objects. Then, it integrates visual semantics and video features into a long short-term memory (LSTM) network. In the decoding stage, to mine the hidden information that can hardly be discovered by machines in the video, the model intercepts partial video frames and detects the visual goals in them. Then, the obtained visual goals are used to extract knowledge from the external knowledge base to supplement the generation of descriptive texts and thus produce more novel and natural text descriptions. The experimental results on datasets MSVD and MSR-VTT demonstrate that the proposed method shows good performance, and the generated content can show novel implicit information to a certain extent.

    参考文献
    [1] Kojima A, Izumi M, Tamura T, et al. Generating natural language description of human behavior from video images. Proceedings of the 15th International Conference on Pattern Recognition. Barcelona: IEEE, 2000. 728–731.
    [2] Zhao B, Li XL, Lu XQ. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Transactions on Image Processing, 2019, 28(11): 5552–5565. [doi: 10.1109/TIP.2019.2916757
    [3] Liu CJ, Wechsler H. Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing, 2002, 11(4): 467–476. [doi: 10.1109/TIP.2002.999679
    [4] Song JK, Yang Y, Yang Y, et al. Inter-media hashing for large-scale retrieval from heterogeneous data sources. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2013. 785–796.
    [5] Krishnamoorthy N, Malkarnenkar G, Mooney R, et al. Generating natural-language video descriptions using text-mined knowledge. Proceedings of the 27th AAAI Conference on Artificial Intelligence. Bellevue: AAAI Press, 2013. 541–547.
    [6] Odonez V, Kulkarni G, Berg TL. Im2Text: Describing images using 1 million captioned photographs. Proceedings of the 24th International Conference on Neural Information Processing Systems. Granada: Curran Associates Inc., 2011. 1143–1151.
    [7] Donahue J, Hendricks LA, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 677–691. [doi: 10.1109/TPAMI.2016.2599174
    [8] Venugopalan S, Rohrbach M, Donahue J, et al. Sequence to sequence-video to text. Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015. 4534–4542.
    [9] Yao L, Torabi A, Cho K, et al. Describing videos by exploiting temporal structure. Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015. 4507–4515.
    [10] Pei WJ, Zhang JY, Wang XR, et al. Memory-attended recurrent network for video captioning. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 8347–8356.
    [11] Gan Z, Gan C, He XD, et al. Semantic compositional networks for visual captioning. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 1141–1150.
    [12] Chen HR, Lin K, Maye A, et al. A semantics-assisted video captioning model trained with scheduled sampling. Frontiers in Robotics and AI, 2020, 7: 475767. [doi: 10.3389/frobt.2020.475767
    [13] Chen M, Li YM, Zhang ZF, et al. TVT: Two-view transformer network for video captioning. Proceedings of the 10th Asian Conference on Machine Learning. Beijing: PMLR, 2018. 847–862.
    [14] 丁恩杰, 刘忠育, 刘亚峰, 等. 基于多维度和多模态信息的视频描述方法. 通信学报, 2020, 41(2): 36–43. [doi: 10.11959/j.issn.1000-436x.2020037
    [15] 李铭兴, 徐成, 李学伟, 等. 基于多模态融合的城市道路场景视频描述模型研究. 计算机应用研究, 2022.
    [16] Zhang ZQ, Shi YY, Yuan CF, et al. Object relational graph with teacher-recommended learning for video captioning. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 13275–13285.
    [17] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet and the impact of residual connections on learning. Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco: AAAI Press, 2017. 4278–4284.
    [18] Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database. Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009. 248–255.
    [19] Zolfaghari M, Singh K, Brox T. ECO: Efficient convolutional network for online video understanding. Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018. 713–730.
    [20] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the Kinetics dataset. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 4724–4733.
    [21] Speer R, Chin J, Havasi C. ConceptNet 5.5: An open multilingual graph of general knowledge. Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco: AAAI Press, 2017. 4444–4451.
    [22] Girshick R. Fast R-CNN. Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago: IEEE, 2015. 1440–1448.
    [23] Chen DL, Dolan W B. Collecting highly parallel data for paraphrase evaluation. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland: ACL, 2011. 190–200.
    [24] Xu J, Mei T, Yao T, et al. MSR-VTT: A large video description dataset for bridging video and language. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 5288–5296.
    [25] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: ACL, 2002. 311–318.
    [26] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: ACL, 2005. 65–72.
    [27] Lin CY. ROUGE: A package for automatic evaluation of summaries. Proceedings of the 2004 Text Summarization Branches Out. Barcelona: ACL, 2004. 74–81.
    [28] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4566–4575.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

王林,白云帆.基于特征强化与知识补充的视频描述方法.计算机系统应用,2023,32(5):273-282

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2022-11-07
  • 最后修改日期:2022-12-10
  • 在线发布日期: 2023-03-24
文章二维码
您是第11203767位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号