Abstract:This study aims to delve into the joint detection of traffic signs and signals under complex and variable traffic conditions, analyzing and resolving the detrimental effects of harsh weather, low lighting, and image background interference on detection accuracy. To this end, an improved RT-DETR network is proposed. Based on a resource-limited operating environment, this study introduces a network, ResNet with PConv and efficient multi-scale attention (PE-ResNet), as the backbone to enhance the model’s capability to detect occlusions and small targets. To augment the feature fusion capability, a new cross-scale feature-fusion module (NCFM) is introduced, which facilitates better integration of semantic and detailed information within images, offering a more comprehensive understanding of complex scenes. Additionally, the MPDIoU loss function is introduced to more accurately measure the positional relationships among target boxes. The improved network reduces the parameter count by approximately 14% compared to the baseline model. On the CCTSDB 2021 dataset, S2TLD dataset, and the self-developed multi-scene traffic signs (MTST) dataset, the mAP50:95 increases by 1.9%, 2.2%, and 3.7%, respectively. Experimental results demonstrate that the enhanced RT-DETR model effectively improves target detection accuracy in complex scenarios.