Domain-Specific Utterance End-Point Detection for Speech Recognition

Roland Maas, Ariya Rastrow, Kyle Goehner, Gautam Tiwari, Shaun Joseph, Björn Hoffmeister


The task of automatically detecting the end of a device-directed user request is particularly challenging in case of switching short command and long free-form utterances. While low-latency end-pointing configurations typically lead to good user experiences in the case of short requests, such as “play music”, it can be too aggressive in domains with longer free-form queries, where users tend to pause noticeably between words and hence are easily cut off prematurely. We previously proposed an approach for accurate end-pointing by continuously estimating pause duration features over all active recognition hypotheses. In this paper, we study the behavior of these pause duration features and infer domain-dependent parametrizations. We furthermore propose to adapt the end-pointer aggressiveness on-the-fly by comparing the Viterbi scores of active short command vs. long free-form decoding hypotheses. The experimental evaluation evidences a 18% relative reduction in word error rate on free-form requests while maintaining low latency on short queries.


 DOI: 10.21437/Interspeech.2017-1673

Cite as: Maas, R., Rastrow, A., Goehner, K., Tiwari, G., Joseph, S., Hoffmeister, B. (2017) Domain-Specific Utterance End-Point Detection for Speech Recognition. Proc. Interspeech 2017, 1943-1947, DOI: 10.21437/Interspeech.2017-1673.


@inproceedings{Maas2017,
  author={Roland Maas and Ariya Rastrow and Kyle Goehner and Gautam Tiwari and Shaun Joseph and Björn Hoffmeister},
  title={Domain-Specific Utterance End-Point Detection for Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1943--1947},
  doi={10.21437/Interspeech.2017-1673},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1673}
}