基于双编码器表示学习的多模态情感分析
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金(61070015)


Multimodal Sentiment Analysis Based on Dual Encoder Representation Learning
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 增强出版
  • |
  • 文章评论
    摘要:

    多模态情感分析旨在通过用户上传在社交平台上的视频来判断用户的情感. 目前的多模态情感分析研究主要是设计复杂的多模态融合网络来学习模态之间的一致性信息, 在一定程度上能够提升模型的性能, 但它们大部分都忽略了模态之间的差异性信息所起到的互补作用, 从而导致情感分析出现偏差. 本文提出了一个基于双编码器表示学习的多模态情感分析模型DERL (dual encoder representation learning), 该模型通过双编码器结构学习模态不变表征和模态特定表征. 具体来说, 我们利用基于层级注意力机制的跨模态交互编码器学习所有模态的模态不变表征, 获取一致性信息; 利用基于自注意力机制的模态内编码器学习模态私有的模态特定表征, 获取差异性信息. 此外, 我们设计两个门控网络单元对编码后的特征进行增强和过滤, 以更好地结合模态不变和模态特定表征, 最后在融合时通过缩小不同多模态表示之间的L2距离以捕获它们之间潜在的相似情感用于情感预测. 在两个公开的数据集CMU-MOSI和CMU-MOSEI上的实验结果表明该模型优于一系列基线模型.

    Abstract:

    Multimodal sentiment analysis aims to assess users’ sentiment by analyzing the videos they upload on social platforms. The current research on multimodal sentiment analysis primarily focuses on designing complex multimodal fusion networks to learn the consistency information among modalities, which enhances the model’s performance to some extent. However, most of the research overlooks the complementary role played by the difference information among modalities, resulting in sentiment analysis biases. This study proposes a multimodal sentiment analysis model called DERL (dual encoder representation learning) based on dual encoder representation learning. This model learns modality-invariant representations and modality-specific representations by a dual encoder structure. Specifically, a cross-modal interaction encoder based on a hierarchical attention mechanism is employed to learn the modality-invariant representations of all modalities to obtain consistency information. Additionally, an intra-modal encoder based on a self-attention mechanism is adopted to learn the modality-specific representations within each modality and thus capture difference information. Furthermore, two gate network units are designed to enhance and filter the encoded features and enable a better combination of modality-invariant and modality-specific representations. Finally, during fusion, potential similar sentiment between different multimodal representations is captured for sentiment prediction by reducing the L2 distance among them. Experimental results on two publicly available datasets CMU-MOSI and CMU-MOSEI show that this model outperforms a range of baselines.

    参考文献
    相似文献
    引证文献
引用本文

冼广铭,阳先平,招志锋.基于双编码器表示学习的多模态情感分析.计算机系统应用,2024,33(4):13-25

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2023-10-15
  • 最后修改日期:2023-11-15
  • 录用日期:
  • 在线发布日期: 2024-01-30
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号