Abstract:The spectral correlation of each pixel in hyperspectral images is strong, making it prone to the “same spectrum, different objects” problem during the recognition process. In addition, the high dimensionality of spectral bands poses challenges for traditional models in effectively associating spectral and spatial features. To address these issues, an SSG-VIT model is proposed, integrating hierarchical depthwise separable convolution, graph convolution, and group separable self-attention (GSA) for multi-scale feature fusion. Specifically, hierarchical depthwise separable convolution is employed to extract local spatial features at multiple scales using different kernel sizes, while GSA captures global spatial relationships. A graph convolution module is incorporated to model the structured spectral features, eliminate redundant information, and enhance spectral feature representation. Finally, an adaptive feature fusion (AFF) mechanism integrates the spatial and spectral features. The proposed model is evaluated on three hyperspectral datasets: Indian pines, Salinas, and Botswana. The overall accuracies (OA) achieved are 99.32%, 99.67%, and 99.69%, respectively, across multiple experiments.