Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants’ audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing in-domain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.
Cite as: Vaaras, E., Ahlqvist-Björkroth, S., Drossos, K., Räsänen, O. (2021) Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit. Proc. Interspeech 2021, 3380-3384, doi: 10.21437/Interspeech.2021-303
@inproceedings{vaaras21_interspeech, author={Einari Vaaras and Sari Ahlqvist-Björkroth and Konstantinos Drossos and Okko Räsänen}, title={{Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3380--3384}, doi={10.21437/Interspeech.2021-303} }