Abstract:Addressing the challenge of efficiently fusing audio and video features while accurately extracting time-dependent emotion information in audiovisual emotion recognition, a mutual information-based audiovisual emotion recognition model is proposed, incorporating Kolmogorov-Arnold long short-term memory (KLSTM). Feature selection and adaptive window processing, based on the mutual information approach, are employed to extract emotionally relevant key segments from audio and video signals, effectively reducing information redundancy. The KLSTM network is integrated into feature extraction to capture the temporal dependencies of audiovisual modal signals. In the fusion stage, cross-modal consistency maximization ensures the coordination and complementarity of audio and video features. Experimental results demonstrate that the proposed model outperforms existing benchmark models on both CMU-MOSI and CMU-MOSEI datasets, validating its effectiveness in multimodal emotion recognition tasks.