Abstract:This study focuses on the challenging task of indoor crowd counting. In indoor scenes, people often get together and perform similar tasks in constrained spaces. As most behaviors of indoor crowds are consequently quite similar, it is important to acquire a global receptive field and identify similarities in indoor crowd features. To address this problem, this study designs a circular convolution network, which combines the advantages of convolution neural networks and Transformer, to obtain both local and global correlations of the crowd features. Compared with the Transformer-based methods, this network adopts a much simpler and more efficient circular convolutional module. Moreover, a novel inverse-transform Bayesian loss function, which suits both sparse and crowded indoor scenes with large-scale variations, is proposed. Finally, to alleviate the influence of the annotation deviation, a label diffusion strategy that expands annotation areas by assuming adjacent pixels of each original annotation point may also potentially represent head centers. Compared with the second-best method on Class A, Class B, Canteen, and Mall datasets, this method improves MAE/RMSE by 4.1%/4.4%, 5.8%/8.0%, 3.9%/1.6%, and 3.9%/1.6%, respectively.