15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Detecting Incorrectly-Segmented Utterances for Posteriori Restoration of Turn-Taking and ASR Results

Naoki Hotta (1), Kazunori Komatani (1), Satoshi Sato (1), Mikio Nakano (2)

(1) Nagoya University, Japan
(2) Honda Research Institute Japan, Japan

Appropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. We have developed a method that performs a posteriori restoration of incorrectly segmented utterances caused by erroneous voice activity detection (VAD), which result in automatic speech recognition (ASR) errors and inappropriate turn-taking. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information to improve its accuracy. Furthermore, two kinds of feature selection are performed to obtain effective and domain-independent features. The experimental results showed that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM).

Full Paper

Bibliographic reference.  Hotta, Naoki / Komatani, Kazunori / Sato, Satoshi / Nakano, Mikio (2014): "Detecting incorrectly-segmented utterances for posteriori restoration of turn-taking and ASR results", In INTERSPEECH-2014, 313-317.