Abstract:Object images in the real world often have large intra-class variations, and thus using a single prototype to describe an entire category will lead to semantic ambiguity. Considering this, a multi-prototype generation module based on superpixels is proposed, which uses multiple prototypes to represent different semantic regions of objects and employs the context to correct prototypes among the generated prototypes by a graph neural network to ensure the orthogonality of the sub-prototypes. To obtain a more accurate prototype representation, a Transformer-based semantic alignment module is designed to mine the semantic information contained in the features of the query images and the background features of the supporting images. In addition, a multi-scale feature fusion structure is proposed to instruct the model to focus on features that appear in both the supporting images and the query images, which can improve the robustness to changes in object scales. The proposed model is tested on the PASCAL-5i dataset, and the mean intersection over union (mIoU) is improved by 6% compared with that of the baseline model.