Abstract:A lightweight semantic segmentation network based on encoder-decoder architecture with fusion attention mechanism is proposed to address the issues of feature loss and effective bimodal fusion in image semantic segmentation in complex indoor scenes. Firstly, two residual networks are used as backbone networks to extract features from RGB and depth images, and a polarized self-attention (PSA) module is introduced into the encoder. Then, a bimodal fusion module is designed and introduced to effectively fuse RGB and depth features at different stages. A context module is introduced to obtain dependencies between regions. Finally, three decoders of different sizes are applied to skip connect and fuse the previous multi-scale feature maps to improve the segmentation accuracy of small targets. The proposed network model is trained and tested on the NYUDv2 datasets and compared with more advanced RGB-D semantic segmentation networks. The experiments show that the proposed network has good segmentation performance.