Abstract:In recent years, Visual Question Answering (VQA) based on the fusion of image visual features and question text features has attracted wide attention from researchers. Most of the existing models enable fine-grained interaction and matching by the attention mechanism and intensive iterative operations according to the similarity of image regions and question word pairs, thereby ignoring the autocorrelation information of image regions and question words. This paper introduces a model based on a symmetrical attention mechanism. It can effectively reduce the overall semantic deviation by analyzing the semantic association between images and questions, improving the accuracy of answer prediction. Experiments are conducted on the VQA2.0 data set, and results prove that the proposed model based on the symmetric attention mechanism has evident advantages over the baseline model.