Abstract:In response to challenges faced in crowd counting, such as non-uniform head sizes, uneven crowd density distribution, and complex background interference, a convolutional neural network (CNN) model (multi-scale feature weighted fusion attention convolutional neural network, MSFANet) that focuses on crowd regions and addresses multi-scale changes is proposed. The front end of the network adopts an improved VGG-16 model to perform the first step of coarse-grained feature extraction on the input crowd image. A multi-scale feature extraction module is added in the middle to extract the multi-scale feature information of the image. Then, an attention module is added to weigh the multi-scale features. At the back end, a sawtooth shaped dilated convolution module is adopted to increase the receptive field, extract the detailed features of the image, and generate high-quality crowd density maps. Experiments on this model are conducted on three public datasets. The results show that on the Shanghai Tech Part B dataset, the mean absolute error (MAE) is reduced to 7.8, and the mean squared error (MSE) decreases to 12.5. On the Shanghai Tech Part A dataset, the MAE is reduced to 64.9, and the MSE decreases to 108.4. On the UCF_CC_50 dataset, the MAE is reduced to 185.1, and the MSE decreases to 249.8. These experimental results affirm that the proposed model exhibits strong accuracy and robustness.