Abstract:Video scene recognition has attracted much attention in the field of machine learning and computer vision. It is not only an important practical application, but also a challenge for image understanding in the field of computer vision. Nevertheless, current exploration of video scene recognition has not been unable to meet the needs of production environment. And most proposed models only use video-level feature information, while ignore association of multi-granularity video feature. In this study, we propose an architecture of attention mechanism with multi-granularity video features, which can make use of the rich semantic association among the various dimensions of video information dynamically and efficiently, and improve the performance of the model. The experiments are conducted on the latest VideoNet dataset released by CCF China MM 2019. The result shows that the proposed model based on attention mechanism model with multi-granularity video features outperforms the previous methods.