Abstract:Currently, in multimodal sentiment analysis tasks, there are problems such as insufficient single modal feature extraction and lack of stability in data fusion methods. This study proposes a method of optimizing modal features that uses interpolation to solve these problems. Firstly, the interpolation-optimized BERT and GRU models are applied to extract features, and both of the models are used to mine text, audio, and video information. Secondly, an improved attention mechanism is used to fuse text, audio, and video information, thus achieving modal fusion more stably. This method is tested on the MOSI and MOSEI datasets. The experimental results show that using interpolation can improve the accuracy of multi-modal sentiment analysis tasks based on optimizing modal features. This result verifies the effectiveness of interpolation.