面向视音频事件定位的跨模态时间对齐网络
作者:
基金项目:

重庆市教育科学规划重点课题(K22YE205098); 重庆师范大学博士启动基金(21XLB030, 21XLB029)


Cross-modal Time Alignment Network for Audio-visual Event Localization
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [32]
  • |
  • 相似文献 [20]
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    视音频事件定位(audio-visual event localization, AVEL)任务通过观察音频信息和相对应的视觉信息来定位视频中的事件. 本文针对AVEL任务设计了一种跨模态时间对齐网络CMTAN, 该网络包含预处理、跨模态交互、时间对齐和特征融合这4个部分. 具体而言, 在预处理部分, 通过一种新的跨模态音频指导模块和一种噪音弱化模块的处理, 模态信息中的背景和噪音被减少. 随后, 在跨模态交互部分, 使用基于多头注意力机制的信息强化和信息补充模块进行跨模态交互, 单模态信息得到全局信息优化. 在时间对齐部分, 本文设计了一种聚焦于跨模态交互前后单模态全局信息的时间对齐模块, 用于执行模态信息的特征对齐. 最后, 在特征融合过程中, 通过一种多阶段融合模块, 两种模态信息被从浅入深地融合, 且融合后的模态信息最终将被用于事件定位. 大量实验表明CMTAN在弱监督和全监督AVEL任务中都具有优秀的性能.

    Abstract:

    The audio-visual event localization (AVEL) task locates events in a video by observing audio information and corresponding visual information. In this paper, a cross-modal time alignment network named CMTAN is designed for the AVEL task. The network consists of four parts: preprocessing, cross-modal interaction, time alignment, and feature fusion. Specifically, in the preprocessing part, the background and noise in the modal information are reduced by the processing of a new cross-modal audio guidance module and a noise reduction module. Then, in the cross-modal interaction part, the information reinforcement and information complementation modules based on the multi-head attention mechanism are used for cross-modal interaction, and the unimodal information is optimized with global information. In the time alignment part, a time alignment module focusing on the unimodal global information before and after cross-modal interaction is designed to perform feature alignment of modal information. Finally, in the feature fusion process, two kinds of modal information are fused from shallow to deep by a multi-stage fusion module. The fused modal information is ultimately used for event localization. Extensive experiments demonstrate that CMTAN has excellent performance in both weakly and fully supervised AVEL tasks.

    参考文献
    [1] Tian YP, Shi J, Li BC, et al. Audio-visual event localization in unconstrained videos. Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018. 252–268.
    [2] Lin YB, Li YJ, Wang YCF. Dual-modality Seq2Seq network for audio-visual event localization. Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019. 2002–2006.
    [3] Wu Y, Zhu LC, Yan Y, et al. Dual attention matching for audio-visual event localization. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019. 6291–6299.
    [4] Xuan HY, Zhang ZY, Chen S, et al. Cross-modal attention network for temporal inconsistent audio-visual event localization. Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York: AAAI, 2020. 279–286.
    [5] Xu HM, Zeng RH, Wu QY, et al. Cross-modal relation-aware networks for audio-visual event localization. Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020. 3893–3901.
    [6] Liu S, Quan WZ, Liu Y, et al. Bi-directional modality fusion network for audio-visual event localization. Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022. 4868–4872.
    [7] Xia Y, Zhao Z. Cross-modal background suppression for audio-visual event localization. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 19957–19966.
    [8] Zhou JX, Zheng L, Zhong YR, et al. Positive sample propagation along the audio-visual event line. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 8432–8440.
    [9] Arandjelovic R, Zisserman A. Look, listen and learn. Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017. 609–617.
    [10] Long X, Gan C, de Melo G, et al. Multimodal keyless attention fusion for video classification. Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans: AAAI, 2018. 7202–7209.
    [11] Hori C, Hori T, Wichern G, et al. Multimodal attention for fusion of audio and spatiotemporal features for video description. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City: IEEE, 2018. 2528–2531.
    [12] Nawaz S, Janjua MK, Gallo I, et al. Deep latent space learning for cross-modal mapping of audio and visual signals. Proceedings of the 2019 Digital Image Computing: Techniques and Applications. Perth: IEEE, 2019. 1–7.
    [13] Lee J T, Jain M, Park H, et al. Cross-attentional audio-visual fusion for weakly-supervised action localization. Proceedings of the 9th International Conference on Learning Representations. Vienna: OpenReview.net, 2021. 1–17.
    [14] Li XY, Liu J, Xie YR, et al. MAGDRA: A multi-modal attention graph network with dynamic routing-by-agreement for multi-label emotion recognition. Knowledge-based Systems, 2024, 283: 111126.
    [15] Gao RH, Oh TH, Grauman K, et al. Listen to look: Action recognition by previewing audio. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10454–10464.
    [16] Cheng Y, Wang RZ, Pan ZH, et al. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning. Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020. 3884–3892.
    [17] Hu D, Nie FP, Li XL. Deep multimodal clustering for unsupervised audiovisual learning. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2018. 9240–9249.
    [18] Alwassel H, Mahajan D, Korbar B, et al. Self-supervised learning by cross-modal audio-video clustering. Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver, 2020. 9758–9770.
    [19] Zhang JR, Xu X, Shen FM, et al. Enhancing audio-visual association with self-supervised curriculum learning. Proceedings of the 35th AAAI Conference on Artificial Intelligence. AAAI, 2021. 3351–3359.
    [20] Morgado P, Vasconcelos N, Misra I. Audio-visual instance discrimination with cross-modal agreement. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 12470–12481.
    [21] 冼广铭, 阳先平, 招志锋. 基于双编码器表示学习的多模态情感分析. 计算机系统应用, 2024, 33(4): 13–25.
    [22] 齐泽华. 基于Transformer的音视频事件定位的研究 [硕士学位论文]. 成都: 电子科技大学, 2023.
    [23] Cheng HY, Liu ZY, Zhou H, et al. Joint-modal label denoising for weakly-supervised audio-visual video parsing. Proceedings of the 17th European Conference on Computer Vision. Tel Aviv: Springer, 2022. 431–448.
    [24] Lin YB, Wang YCF. Audiovisual Transformer with instance attention for audio-visual event localization. Proceedings of the 15th Asian Conference on Computer Vision. Kyoto: Springer, 2021. 274–290.
    [25] Yu JS, Cheng Y, Feng R. MPN: Multimodal parallel network for audio-visual event localization. Proceedings of the 2021 IEEE International Conference on Multimedia and Expo. Shenzhen: IEEE, 2021. 1–6.
    [26] Liu HW, Gu XD. Masked co-attention model for audio-visual event localization. Applied Intelligence, 2024, 54(2): 1691–1705.
    [27] Yu JS, Cheng Y, Zhao RW, at al. MM-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. Proceedings of the 30th ACM International Conference on Multimedia. Lisboa: ACM, 2022. 6241–6249.
    [28] Ramaswamy J. What makes the sound? A dual-modality interacting network for audio-visual event localization. Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020. 4372–4376.
    [29] Ramaswamy J, Das S. See the sound, hear the pixels. Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision. Snowmass: IEEE, 2020. 2959–2968.
    [30] Wang XL, Girshick R, Gupta A, et al. Non-local neural networks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 7794–7803.
    [31] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [32] Lee S, Kim D, Han B. CoSMo: Content-style modulation for image retrieval with text feedback. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 802–812.
    引证文献
引用本文

王志豪,訾玲玲.面向视音频事件定位的跨模态时间对齐网络.计算机系统应用,2025,34(3):133-142

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-08-24
  • 最后修改日期:2024-09-19
  • 在线发布日期: 2025-01-16
文章二维码
您是第10784803位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号