Abstract:Multimodal emotion recognition in conversation aims to understand the emotions behind utterances in conversation by analyzing various types of data generated during the conversation, such as text, audio, and visual data. Therefore, numerous multimodal information fusion-based methods have been proposed and achieved notable performance. However, these methods often neglect that modality importance varies across different contexts, and they overlook the heterogeneity of multimodal data, which can lead to a significant gap between modal features, thereby hindering effective multimodal fusion. Therefore, this study proposes a modality de-heterogenization and adaptive fusion model for emotion recognition in conversation to address the aforementioned issues. First, a shared encoder is used to map features from different modalities into a shared semantic space to preliminarily reduce the gap between modal features. Then, shared convolutional networks are employed to maximize mutual semantic information across modalities, eliminating the gap between modal features, and private convolutional networks are used to maintain the diversity of modal features. Subsequently, the self-attention mechanism is employed to learn the importance of each modality, thereby achieving adaptive fusion of modal information. Finally, experimental results on two public datasets demonstrate that the proposed model outperforms existing baseline models.