Abstract:At present, convolutional neural networks (CNNs) based on local attention mechanism have yielded sound results in feature extraction of kinship recognition. However, the improvement of backbone models based on CNNs is not obvious, and few researchers employ self-attention mechanisms with global information capture ability. Therefore, an S-ViT model based on a convolution-free backbone feature extraction network is proposed, which is to adopt Vision Transformer with a self-global attention mechanism as the basic backbone feature extraction network. By constructing a twin network and a CNN with a local attention mechanism, the traditional classification network is expanded for research on related issues of kinship recognition. The final experimental results show that compared with the leading method of the RFIW2020 Challenge, the proposed method has performed well in the three kinship recognition tasks. The first task ranks second with verification accuracy of 76.8%, and the second and third tasks rank third. As a result, the feasibility and effectiveness of the method are improved to propose a new solution to kinship recognition.