Abstract:Clinical diagnoses can be facilitated through the utilization of multi-organ medical image segmentation. This study proposes a multi-level feature interaction Transformer model to address the issues of weak global feature extraction capability in CNN, weak local feature extraction capability in Transformer, and the quadratic computational complexity problem of Transformer for multi-organ medical image segmentation. The proposed model employs CNN for extracting local features, which are then transformed into global features through Swin Transformer. Multi-level local and global features are generated through down-sampling, and each level of local and global features undergo interaction and enhancement. After the enhancement at each level, the features are cross-fused by multi-level feature fusion modules. The features, once again fused, pass through up-sampling and segmentation heads to produce segmentation masks. The proposed model is experimented on the Synapse and ACDC datasets, achieving average dice similarity coefficient (DSC) and average 95th percentile Hausdorff distance (HD95) values of 80.16% and 19.20 mm, respectively. These results outperform representative models such as LGNet and RFE-UNet. The proposed model is effective for multi-organ medical image segmentation.