Abstract:In the context of current multi-modal emotion analysis in videos, the influence of modality representation learning on modality fusion and final classification results has not been adequately considered. To this end, this study proposes a multi-modal emotion analysis model that integrates cross-modal representation learning. Firstly, the study utilizes Bert and LSTM to extract internal information from text, audio, and visual modalities separately, followed by cross-modal representation learning to obtain more information-rich unimodal features. In the modal fusion stage, the study fuses the gating mechanism and improves the traditional Transformer fusion mechanism to control the information flow more accurately. Experimental results on the publicly available CMU-MOSI and CMU-MOSEI datasets demonstrate that the accuracy and F1 score of this model are improved compared with the traditional models, validating the effectiveness of this model.