Abstract:One of the key tasks of aspect-level multimodal sentiment classification is to accurately extract and fuse complementary information from two different modals of text and vision, so as to detect the sentiment orientation of the aspect words mentioned in the text. Most of the existing methods only use single context information combined with image information for analysis, revealing the problems such as insensitive to the recognition of the correlation between aspect-, context- and visual-information, and imprecise in local extraction of aspect-related information in vision. In addition, when performing feature fusion, insufficient partial modal information will lead to mediocre fusion effect. To solve the above problems, an attention fusion network AF-Net model is proposed to perform aspect-level multimodal sentiment classification in this study. The spatial transformation network (STN) is used to learn the location information of objects in images to help extract important local features. The Transformer based interaction network is used to model the relationship between aspects, texts and images, and realize multi-modal interaction. At the same time, the similar information between different modal features is supplemented and the multi-feature information is fused by multi-attention mechanism to represent the multi-modal information. Finally, the result of sentiment classification is obtained through Softmax layer. Experiments and comparisons carried out on the two benchmark datasets show that AF-Net can achieve better performance and improve the effect of aspect-level multimodal sentiment classification.