Abstract:Transformer-based object detection algorithms often suffer from problems such as insufficient accuracy and slow convergence. Although many studies have proposed improvements to address these problems and have achieved certain outcomes, most of them overlook two key shortcomings when applying Transformer structure to the field of object detection. Firstly, self-attention computation results are not diversified. Secondly, due to the complexity of set prediction, the models are unstable during target matching. To overcome these deficiencies, this study proposes several enhancements. Firstly, an adaptive token pooling module is designed to increase self-attention weight diversity. Secondly, a rough-prediction-based anchor box localization module is introduced, which provides positional prior information for queries to enhance stability during bipartite matching. Lastly, a group-based denoising task is designed, which trains the model to distinguish between positive and negative queries near the target, thereby improving the model’s ability to perform set prediction. Experimental results show that the proposed improved algorithm achieves better training results on the COCO dataset. Compared with the baseline model, the improved algorithm significantly outperforms in both detection accuracy and convergence speed.