In this work, we design a neural network for recognizing emotions in speech, using the IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. We examine the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results of 64:5% for weighted accuracy and 61:7% for unweighted accuracy on four emotions.
DOI: 10.21437/SMM.2018-5
Cite as: Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L., Schmauch, B. (2018) CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation. Proc. Workshop on Speech, Music and Mind 2018, 21-25, DOI: 10.21437/SMM.2018-5.
@inproceedings{Etienne2018, author={Caroline Etienne and Guillaume Fidanza and Andrei Petrovskii and Laurence Devillers and Benoit Schmauch}, title={CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation}, year=2018, booktitle={Proc. Workshop on Speech, Music and Mind 2018}, pages={21--25}, doi={10.21437/SMM.2018-5}, url={http://dx.doi.org/10.21437/SMM.2018-5} }