Abstract:This study proposes a cross feature fusion and RASPP-driven scene segmentation method to address the edge segmentation errors and feature discontinuity caused by target diversity and scale inconsistency in the scenes. This method combines the multi-scale features output by the encoder in the way of cross feature fusion and employs the compound convolution attention module to process high-level semantic information fusion. As a result, this avoids the feature information loss caused by the upsampling operation and the influence of noise and refines the segmentation effect of target edges. Meanwhile, this study proposes a depthwise separable convolution combining residual connections. Based on this, a pyramid pooling module RASPP combining residuals is designed and implemented to process the features after cross fusion, obtain contextual information at different scales, and enhance feature semantic expression. Finally, the features processed by the RASPP module are merged to improve the segmentation effect. The experimental results on the Cityscapes and CamVid datasets show that the proposed method outperforms existing methods and has better segmentation performance on target edges in the scenes.