Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. We have developed a method that performs a posteriori restoration of incorrectly segmented utterances caused by erroneous voice activity detection (VAD), which result in automatic speech recognition (ASR) errors and inappropriate turn-taking. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information to improve its accuracy. Furthermore, two kinds of feature selection are performed to obtain effective and domain-independent features. The experimental results showed that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).
Bibliographic reference. Hotta, Naoki / Komatani, Kazunori / Sato, Satoshi / Nakano, Mikio (2014): "Detecting incorrectly-segmented utterances for posteriori restoration of turn-taking and ASR results", In INTERSPEECH-2014, 313-317.