Abstract:Image segmentation of marine organisms is fundamental to intelligent ocean monitoring but remains challenging due to issues such as cross-modal semantic deviation, inefficient multi-scale fusion, and insufficient modeling of biological structures. To address these challenges, this study proposes Mseg, a CLIP-based multimodal semantic segmentation framework, to achieve effective segmentation of unseen categories. The method integrates visual image features with textual category descriptions, while employing a lightweight cross-attention (LCA) mechanism and a multi-level feature fusion strategy to guide the interaction between visual and textual representations, thereby generating semantically enriched image representations. Subsequently, a BalanceITV module is introduced to dynamically weight and adaptively balance the two streams of features, namely, the backbone visual features and the language-guided features. Moreover, an uncertainty modeling method on marine organism morphology perception is designed to enhance segmentation precision and robustness, particularly in boundary regions and areas with complex biological structures. Experiments on multiple marine organism datasets show that Mseg consistently outperforms existing methods in zero-shot segmentation tasks, demonstrating its strong adaptability and effectiveness in complex underwater environments.