Abstract:In order to solve the problem of unbalanced sample distribution in a dataset in Speech Emotion Recognition (SER), this study proposes a SER method combining a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) units with data balance and an attention mechanism. This method first extracts the log-Mel spectrogram from the samples in a speech emotion dataset and devides the sample distribution into segments according to sample distribution for balance. Then, this method fine-tunes the pre-trained CNN model in the segmented Mel-spectrum dataset to learn high-level speech segments. Next, given the differences in the emotion recognition of different segments in speech, the learned segmented CNN features are input into the LSTM with an attention mechanism for learning discriminative features, and speech emotions are classified with LSTM and Softmax layers. The experimental results in the BAUM-1s and CHEAVD2.0 datasets show that the method proposed in this study has much better performance than conventional methods.