ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Emotion Recognition from Speech Using wav2vec 2.0 Embeddings

Leonardo Pepino, Pablo Riera, Luciana Ferrer

Emotion recognition datasets are relatively small, making the use of deep learning techniques challenging. In this work, we propose a transfer learning method for speech emotion recognition (SER) where features extracted from pre-trained wav2vec 2.0 models are used as input to shallow neural networks to recognize emotions from speech. We propose a way to combine the output of several layers from the pre-trained model, producing richer speech representations than the model’s output alone. We evaluate the proposed approaches on two standard emotion databases, IEMOCAP and RAVDESS, and compare different feature extraction techniques using two wav2vec 2.0 models: a generic one, and one finetuned for speech recognition. We also experiment with different shallow architectures for our speech emotion recognition model, and report baseline results using traditional features. Finally, we show that our best performing models have better average recall than previous approaches that use deep neural networks trained on spectrograms and waveforms or shallow neural networks trained on features extracted from wav2vec 1.0.


doi: 10.21437/Interspeech.2021-703

Cite as: Pepino, L., Riera, P., Ferrer, L. (2021) Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. Proc. Interspeech 2021, 3400-3404, doi: 10.21437/Interspeech.2021-703

@inproceedings{pepino21_interspeech,
  author={Leonardo Pepino and Pablo Riera and Luciana Ferrer},
  title={{Emotion Recognition from Speech Using wav2vec 2.0 Embeddings}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3400--3404},
  doi={10.21437/Interspeech.2021-703}
}