Abstract:To address the challenges in skeleton action recognition caused by complex actions and ambiguous samples, this study proposes a co-optimization framework that combines adaptive graph topology refinement (AGTR) and cross-sequence contrastive learning (CSCL). AGTR leverages multi-head attention to dynamically construct joint connectivity graphs, overcoming the limitations of fixed structures and enabling the decoupling of multi-view features. CSCL integrates segment-level, instance-level, and prototype-level contrastive losses, coupled with dynamic hard sample mining, to improve the modeling of temporal semantic consistency and long-tailed distributions. Extensive experiments on the NTU RGB+D 120 dataset demonstrate that the proposed method achieves an accuracy of 89.8%, surpassing the hypergraph and Transformer-based method, Hyperformer (86.9%), by 2.9 percentage points. It also enhances robustness under noise and occlusion by 18.8%, while balancing efficiency (3.1 GFLOPs, 25 f/s). This study offers a high-accuracy, interpretable, and deployable solution for complex action recognition, with significant potential in intelligent healthcare and industrial human-robot interaction.