Abstract:To tackle the problem of poor recognition accuracy caused by large changes of crowd target feature in a high-density scenario, this study proposes two kinds of multi-scale feature fusion structures: attention-weighted fusion module (AWF) and bottom-up fusion module (BUF). The AWF module uses the attention branch to learn the weights of feature maps, and the weighted multi-scale features are superposed finally. The BUF module uses dilated convolution to obtain more scale information during feature processing, and the shallow feature maps are merged by stitching. The processed feature map has stronger expressive ability, and the predicted density map is more accurate. Taking ResNet50 as the backbone network for feature extraction, the algorithm presented in this study uses AWF and BUF modules for feature fusion respectively, and experiments are conducted on public datasets. The results show that the crowd counting algorithm with the AWF module can reduce the mean absolute error (MAE) to 45.54 (part A) and 7.6 (part B) and the mean square error (MSE) to 100.28 (part A) and 11.4 (part B) on the Shanghai Tech dataset. On the UCF_CC_50 dataset, the MAE and MSE are decreased to 212.42 and 323.06, respectively. Regarding the algorithm with the BUF module, the MAE is reduced to 51.6 (part A) and 8.0 (part B), and the MSE is decreased to 102 (part A) and 12.8 (part B) on the Shanghai Tech dataset. On the UCF_CC_50 dataset, the MAE and MSE are decreased to 242.6 and 359.5, respectively. Experiments indicate that the AWF module and BUF module can both effectively integrate deep and shallow feature information, thus able to optimize feature maps and improve counting accuracy.