Abstract:Accurate recognition of speech emotion information can help to greatly improve the efficiency of human-computer interaction. At present, the speech emotion recognition system mainly consists of two steps: speech feature extraction and speech feature classification. In order to improve the accuracy of speech emotion recognition, the spectrogram is used as the model input instead of traditional acoustic features, and the CGRU network based on the attention mechanism is adopted to extract the frequency domain and time domain information in the spectrogram. The experimental results show that the introduction of the attention mechanism in the model is beneficial to reduce the interference of redundant information, and compared with the model based on the LSTM network, the model using the GRU network can fast converge during training and has higher prediction accuracy. In addition, the training time of the GRU-based model is only 60% of that of the LSTM-based baseline model.