Abstract:In view of the time-series difficulty in video understanding and a large amount of calculation in traditional methods, we propose a method with spatio-temporal module for action recognition. With a residual network as the framework, this method adds spatio-temporal module to extract images and time series, adds RGB difference to enhance data, and finally uses the NetVLAD method to aggregate all feature information. In this way, actions are classified. The experimental results show that the multimodal method based on spatio-temporal module has better recognition accuracy.