The task of automatically detecting the end of a device-directed user request is particularly challenging in case of switching short command and long free-form utterances. While low-latency end-pointing configurations typically lead to good user experiences in the case of short requests, such as “play music”, it can be too aggressive in domains with longer free-form queries, where users tend to pause noticeably between words and hence are easily cut off prematurely. We previously proposed an approach for accurate end-pointing by continuously estimating pause duration features over all active recognition hypotheses. In this paper, we study the behavior of these pause duration features and infer domain-dependent parametrizations. We furthermore propose to adapt the end-pointer aggressiveness on-the-fly by comparing the Viterbi scores of active short command vs. long free-form decoding hypotheses. The experimental evaluation evidences a 18% relative reduction in word error rate on free-form requests while maintaining low latency on short queries.
Cite as: Maas, R., Rastrow, A., Goehner, K., Tiwari, G., Joseph, S., Hoffmeister, B. (2017) Domain-Specific Utterance End-Point Detection for Speech Recognition. Proc. Interspeech 2017, 1943-1947, doi: 10.21437/Interspeech.2017-1673
@inproceedings{maas17_interspeech, author={Roland Maas and Ariya Rastrow and Kyle Goehner and Gautam Tiwari and Shaun Joseph and Björn Hoffmeister}, title={{Domain-Specific Utterance End-Point Detection for Speech Recognition}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1943--1947}, doi={10.21437/Interspeech.2017-1673} }