Abstract:To address challenges such as large-scale variations in target segmentation regions, mis-segmentation of lesion areas, and blurred boundaries in skin images, this study proposes a novel method for skin lesion segmentation, named MSANet. This approach utilizes the pyramid vision Transformer v2 (PVT v2) as the backbone network, integrating the strengths of both Transformer and convolutional neural networks (CNNs). By improving the multi-layer fusion decoding strategy, the proposed method significantly enhances the accuracy of skin lesion segmentation. The decoding process incorporates a split gated attention block (SGA) to capture multi-scale global and local spatial features, thus enhancing the model’s ability to capture contextual information. The multi-scale contextual attention (MCA) module is employed to extract and integrate both channel and positional information, improving the network’s precision in lesion localization. Experimental results on the ISIC2017 and ISIC2018 datasets demonstrate that MSANet achieves Dice scores of 90.12% and 90.91%, and mIoU scores of 85.82% and 84.27%, respectively, outperforming existing methods in segmentation performance.