Abstract:As a hot spot of human-computer interaction, emotion recognition has been applied in many fields, such as medicine, education, safe driving and e-commerce. Emotions are mainly expressed by facial expression, voice, discourse and so on. Other characteristics such as facial muscles, mood and intonation vary when different kinds of emotions are expressed. Thus, the inaccuracy of emotions determined using a single modal feature is high. Considering that the expressed emotions are mainly perceived by vision and hearing, this study proposes a multimodal expression recognition algorithm based on an audiovisual perception system. Specifically, the emotion features of speech and image modalities are first extracted, and a plurality of classifiers are designed to perform emotion classification experiments for a single feature, from which multiple expression recognition models based on single features are obtained. In the multimodal experiments of speech and images, a late fusion strategy is put forward for feature fusion. Taking into account the weak dependence of different models, this work uses the weighted voting method for model fusion and obtains the integrated expression recognition model based on multiple single-feature models. The AFEW dataset is adopted for facial expression recognition in this study. The comparison of recognition results between the integrated model and the single-feature models for expression recognition verifies that the effect of multimodal emotion recognition based on the audiovisual perception system is better than that of single-modal emotion recognition.