In an online automatic speech recognition system, the role of the endpoint detector is to infer when a user has finished speaking a query. Accurate and low-latency endpoint detection is crucial for natural voice interaction. Classic voice activity detector (VAD) based approaches monitor the incoming audio and trigger when a sufficiently long pause is detected. Such approaches are typically limited due to their inability to distinguish between within and end-of-sentence pauses. In this paper, we propose an endpoint detection algorithm that is integrated with the speech recognition process, leveraging acoustic and language model information in order to distinguish between within and end-of-sentence pauses. Unlike other integrated approaches that are based on the highest-scoring active recognition hypothesis, the proposed algorithm computes the expected pause duration over all active hypotheses, which leads to a more reliable pause duration prediction. We show that our method achieves significantly higher accuracy and lower latency in a comparison to standard approaches for endpoint detection.
Bibliographic reference. Liu, Baiyang / Hoffmeister, Bjorn / Rastrow, Ariya (2015): "Accurate endpointing with expected pause duration", In INTERSPEECH-2015, 2912-2916.