This paper presents a novel modeling called stacked time-asynchronous sequential networks (STASNs) for online end-of-turn detection. An online end-of-turn detection that determines turn-taking points in a real-time manner is an essential component for human-computer interaction systems. In this study, we use long-range sequential information of multiple time-asynchronous sequential features, such as prosodic, phonetic, and lexical sequential features, to enhance online end-of-turn detection performance. Our key idea is to embed individual sequential features in a fixed-length continuous representation by using sequential networks. This enables us to simultaneously handle multiple time-asynchronous sequential features for end-of-turn detection. STASNs can embed all of the sequential information between a start-of-conversation and the current end-of-utterance in a fixed-length continuous representation that can be directly used for classification by stacking multiple sequential networks. Experiments show that STASNs outperforms conventional modeling with limited sequential information. Furthermore, STASNs with senone bottleneck features extracted using senone-based deep neural networks have superior performance without requiring lexical features decoded by an automatic speech recognition process.
Cite as: Masumura, R., Asami, T., Masataki, H., Ishii, R., Higashinaka, R. (2017) Online End-of-Turn Detection from Speech Based on Stacked Time-Asynchronous Sequential Networks. Proc. Interspeech 2017, 1661-1665, doi: 10.21437/Interspeech.2017-651
@inproceedings{masumura17_interspeech, author={Ryo Masumura and Taichi Asami and Hirokazu Masataki and Ryo Ishii and Ryuichiro Higashinaka}, title={{Online End-of-Turn Detection from Speech Based on Stacked Time-Asynchronous Sequential Networks}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1661--1665}, doi={10.21437/Interspeech.2017-651} }