Improved End-of-Query Detection for Streaming Speech Recognition

Matt Shannon, Gabor Simko, Shuo-Yiin Chang, Carolina Parada


In many streaming speech recognition applications such as voice search it is important to determine quickly and accurately when the user has finished speaking their query. A conventional approach to this task is to declare end-of-query whenever a fixed interval of silence is detected by a voice activity detector (VAD) trained to classify each frame as speech or silence. However silence detection and end-of-query detection are fundamentally different tasks, and the criterion used during VAD training may not be optimal. In particular the conventional approach ignores potential acoustic cues such as filler sounds and past speaking rate which may indicate whether a given pause is temporary or query-final. In this paper we present a simple modification to make the conventional VAD training criterion more closely related to end-of-query detection. A unidirectional long short-term memory architecture allows the system to remember past acoustic events, and the training criterion incentivizes the system to learn to use any acoustic cues relevant to predicting future user intent. We show experimentally that this approach improves latency at a given accuracy by around 100 ms for end-of-query detection for voice search.


 DOI: 10.21437/Interspeech.2017-496

Cite as: Shannon, M., Simko, G., Chang, S., Parada, C. (2017) Improved End-of-Query Detection for Streaming Speech Recognition. Proc. Interspeech 2017, 1909-1913, DOI: 10.21437/Interspeech.2017-496.


@inproceedings{Shannon2017,
  author={Matt Shannon and Gabor Simko and Shuo-Yiin Chang and Carolina Parada},
  title={Improved End-of-Query Detection for Streaming Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1909--1913},
  doi={10.21437/Interspeech.2017-496},
  url={http://dx.doi.org/10.21437/Interspeech.2017-496}
}