EUROSPEECH 2003 - INTERSPEECH 2003
In this paper, we propose a novel method of normalizing the voice quality in an utterance for both clean speech and speech contaminated by noise. The normalization method is applied to the N-best hypotheses from an HMM-based classifier, then an SM (Sub-space Method)-based verifier tests the hypotheses after normalizing the monophone scores together with the HMM-based likelihood score. The HMM-SM-based speech recognition system was proposed previously [1, 2] and successfully implemented on a speaker-independent word recognition task and an OOV word rejection task. We extend the proposed system to a connected digit string recognition task by exploring the effect of the voice quality normalization in an utterance for robust ASR and compare it with the HMM-based recognition systems with utterance-level normalization, word-level normalization, monophone-level normalization, and state-level normalization. Experimental results performed on connected 4- digit strings showed that the word accuracy was significantly improved from 95.7% obtained by the typical HMM-based system with utterance-level normalization to 98.2% obtained by the HMM-SM-based system for clean speech, from 88.1% to 91.5% for noise-added speech with SNR=10dB, and from 72.4% to 76.4% for noise-added speech with SNR=5dB, while the other HMM-based systems also showed lower performances.
Bibliographic reference. Ghulam, Muhammad / Fukuda, Takashi / Nitta, Tsuneo (2003): "Voice quality normalization in an utterance for robust ASR", In EUROSPEECH-2003, 2173-2176.