Abstract:Action recognition aims to make computers understand human actions by the processing and analysis of video data. As different modality data have different strengths in the main features such as appearance, gesture, geometric shapes, illumination, and viewpoints, action recognition based on the multi-modality fusion of these features can achieve better performance than the recognition based on single modality data. In this study, a comprehensive survey of multi-modality fusion methods for action recognition is given, and their characteristics and performance improvements are compared. These methods are divided into the late fusion methods and the early fusion methods, where the former includes prediction score fusion, attention mechanisms, and knowledge distillation, and the latter includes feature map fusion, convolution, fusion architecture search, and attention mechanisms. Upon the above analysis and comparison, the future research directions are discussed.