This paper analyzes a) how often listeners interpret the emotional content of an utterance incorrectly when listening to vocoded or natural speech in adverse conditions; b) which noise conditions cause the most misperceptions; and c) which group of listeners misinterpret emotions the most. The long-term goal is to construct new emotional speech synthesizers that adapt to the environment and to the listener. We performed a large-scale listening test where over 400 listeners between the ages of 21 and 72 assessed natural and vocoded acted emotional speech stimuli. The stimuli had been artificially degraded using a room impulse response recorded in a car and various in-car noise types recorded in a real car. Experimental results show that the recognition rates for emotions and perceived emotional strength degrade as signal-to-noise ratio decreases. Interestingly, misperceptions seem to be more pronounced for negative and low-arousal emotions such as calmness or anger, while positive emotions such as happiness appear to be more robust to noise. An ANOVA analysis of listener meta-data further revealed that gender and age also influenced results, with elderly male listeners most likely to incorrectly identify emotions.
Cite as: Lorenzo-Trueba, J., Botinhao, C.V., Henter, G.E., Yamagishi, J. (2017) Misperceptions of the Emotional Content of Natural and Vocoded Speech in a Car. Proc. Interspeech 2017, 606-610, doi: 10.21437/Interspeech.2017-532
@inproceedings{lorenzotrueba17_interspeech, author={Jaime Lorenzo-Trueba and Cassia Valentini Botinhao and Gustav Eje Henter and Junichi Yamagishi}, title={{Misperceptions of the Emotional Content of Natural and Vocoded Speech in a Car}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={606--610}, doi={10.21437/Interspeech.2017-532} }