We address the challenge of improving live end-of-turn detection for situated spoken dialogue systems. While traditionally silence thresholds have been used to detect the user’s end-of-turn, such an approach limits the system’s potential fluidity in interaction, restricting it to a purely reactive paradigm. By contrast, here we present a system which takes a predictive approach. The user’s end-of-turn is predicted live as acoustic features and words are consumed by the system. We compare the benefits of live lexical and acoustic information by feature analysis and testing equivalent models with different feature sets with a common deep learning architecture, a Long Short-Term Memory (LSTM) network. We show the usefulness of incremental enriched language model features in particular. Training and testing onWizard-of-Oz data collected to train an agent in a simple virtual world, we are successful in improving over a reactive baseline in terms of reducing latency whilst minimising the cut-in rate.
Cite as: Maier, A., Hough, J., Schlangen, D. (2017) Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems. Proc. Interspeech 2017, 1676-1680, doi: 10.21437/Interspeech.2017-1593
@inproceedings{maier17_interspeech, author={Angelika Maier and Julian Hough and David Schlangen}, title={{Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1676--1680}, doi={10.21437/Interspeech.2017-1593} }