Abstract:In this study, the gesture recognition based on SKIG RGB-D multimodal isolated gesture video is studied. The RGB and depth videos are extracted into the form of images. Then the sampled 32 frames from images are input to the densely connected 3DCNN component to learn short-term spatiotemporal features, after that the features input to the convolutional GRU to learn long-term spatiotemporal features. Finally, the trained networks for single modal are used to multimodal fusion to improve the recognition accuracy. 99.07% recognition accuracy is obtained on the SKIG dataset, which achieves high accuracy and proves the validity of the network model proposed in this study.