Abstract:Accurate identification of tissues, organs, and lesion regions is one of the most important tasks in medical image analysis. Models based on the U-Net structure dominate the existing research on semantic segmentation of medical images. Combining the advantages of CNN and Transformer, TransUNet has superiority in capturing long-range dependencies and extracting local features, but it is still not accurate enough in extracting and recovering the locations of features. To address this problem, a medical image segmentation model MAF-TransUNet with a multi-attention fusion mechanism is proposed. The model first adds a multi-attention fusion module (MAF) before the Transformer layer to enhance the representation of location information. Then it combines the MAF again in the hopping connection so that the location information can be efficiently transmitted to the decoder side. Finally, the deep convolutional attention module (DCA) is used in the decoding stage to retain more spatial information. The experimental results show that MAF-TransUNet improves the Dice coefficients on the Synapse multi-organ segmentation dataset and ACDC automated cardiac diagnostic dataset by 3.54% and 0.88%, respectively, compared with TransUNet.