Social signal detection, that is, the task of identifying vocalizations like laughter and filler events is a popular task within computational paralinguistics. Recent studies have shown that besides applying state-of-the-art machine learning methods, it is worth making use of the contextual information and adjusting the frame-level scores based on the local neighbourhood. In this study we apply a weighted average time series smoothing filter for laughter and filler event identification, and set the weights using a state-of-the-art optimization method, namely the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Our results indicate that this is a viable way of improving the Area Under the Curve (AUC) scores: our resulting scores are much better than the accuracy scores of the raw likelihoods produced by Deep Neural Networks trained on three different feature sets, and we also significantly outperform standard time series filters as well as DNNs used for smoothing. Our score achieved on the test set of a public English database containing spontaneous mobile phone conversations is the highest one published so far that was realized by feed-forward techniques.
Cite as: Gosztolya, G. (2017) Optimized Time Series Filters for Detecting Laughter and Filler Events. Proc. Interspeech 2017, 2376-2380, doi: 10.21437/Interspeech.2017-932
@inproceedings{gosztolya17_interspeech, author={Gábor Gosztolya}, title={{Optimized Time Series Filters for Detecting Laughter and Filler Events}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2376--2380}, doi={10.21437/Interspeech.2017-932} }