Abstract:Image captioning has attracted much attention in the field of machine learning and computer vision. It is not only an important practical application, but also a challenge for image understanding in the field of computer vision. Nevertheless, existing methods are simply rely on several different visual features and model architectures, the correlation between visual features and user tags has not been fully explored. This study proposes a multifaced attention model based on user tags and visual features. This model can automatically choose more significant image features or contain the user semantic information. The experiments are conducted on MSCOCO dataset, and the results show that the proposed algorithm outperforms the previous methods.