Abstract:Violence can be easily occluded, and the recognition accuracy is low. At present, some algorithms add multi-view video input to solve the occlusion problem and fuse all view data with equal weight. However, video from different views differs in recognition due to shooting distance and occlusion itself. To solve this problem, this study proposes a violence recognition method based on view confidence and attention to improve the accuracy of violence recognition. The input of the temporal difference module (TDM) is expanded to a multi-view angle. The channel attention mechanism is applied to the segment dimension to enhance the ability of cross-segment feature extraction in TDM. The background suppression method is used to highlight the texture features of moving objects and calculate the image confidence of each view. The bilinear pooling method is introduced to fuse multi-view video features, and the weight of local features of each view is assigned according to the view confidence. In this study, validation is performed on both the public dataset CASIA-Action and the self-made dataset. Experiments show that the view confidence method proposed in this study is better than the bilinear pooling method before improvement, and the accuracy of violence recognition is better than that of the existing behavior recognition methods.