We present a new implementation of emotion recognition from the para-lingual
information in the speech, based on a deep neural network, applied
directly to spectrograms. This new method achieves higher recognition
accuracy compared to previously published results, while also limiting
the latency. It processes the speech input in smaller segments —
up to 3 seconds, and splits a longer input into non-overlapping parts
to reduce the prediction latency.
The deep network comprises
common neural network tools — convolutional and recurrent networks
— which are shown to effectively learn the information that represents
emotions directly from spectrograms. Convolution-only lower-complexity
deep network achieves a prediction accuracy of 66% over four emotions
(tested on IEMOCAP — a common evaluation corpus), while a combined
convolution-LSTM higher-complexity model achieves 68%.
The use of spectrograms
in the role of speech-representing features enables effective handling
of background non-speech signals such as music (excl. singing) and
crowd noise, even at noise levels comparable with the speech signal
levels. Using harmonic modeling to remove non-speech components from
the spectrogram, we demonstrate significant improvement of the emotion
recognition accuracy in the presence of unknown background non-speech
signals.
Cite as: Satt, A., Rozenberg, S., Hoory, R. (2017) Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. Proc. Interspeech 2017, 1089-1093, doi: 10.21437/Interspeech.2017-200
@inproceedings{satt17_interspeech, author={Aharon Satt and Shai Rozenberg and Ron Hoory}, title={{Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1089--1093}, doi={10.21437/Interspeech.2017-200} }