CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation

Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, Laurence Devillers, Benoit Schmauch


´╗┐In this work, we design a neural network for recognizing emotions in speech, using the IEMOCAP dataset. Following the latest advances in audio analysis, we use an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies. We examine the techniques of data augmentation with vocal track length perturbation, layer-wise optimizer adjustment, batch normalization of recurrent layers and obtain highly competitive results of 64:5% for weighted accuracy and 61:7% for unweighted accuracy on four emotions.


 DOI: 10.21437/SMM.2018-5

Cite as: Etienne, C., Fidanza, G., Petrovskii, A., Devillers, L., Schmauch, B. (2018) CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation. Proc. Workshop on Speech, Music and Mind 2018, 21-25, DOI: 10.21437/SMM.2018-5.


@inproceedings{Etienne2018,
  author={Caroline Etienne and Guillaume Fidanza and Andrei Petrovskii and Laurence Devillers and Benoit Schmauch},
  title={CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation},
  year=2018,
  booktitle={Proc. Workshop on Speech, Music and Mind 2018},
  pages={21--25},
  doi={10.21437/SMM.2018-5},
  url={http://dx.doi.org/10.21437/SMM.2018-5}
}