8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Automatic Recognition of Connected Vowels Only Using Speaker-Invariant Representation of Speech Dynamics

Satoshi Asakawa, Nobuaki Minematsu, Keikichi Hirose

University of Tokyo, Japan

Speech acoustics vary due to differences in gender, age, microphone, room, lines, and a variety of factors. In speech recognition research, to deal with these inevitable non-linguistic variations, thousands of speakers in different acoustic conditions were prepared to train acoustic models of individual phonemes. Recently, a novel representation of speech dynamics was proposed [1, 2], where the above non-linguistic factors are effectively removed from speech as if pitch information is removed from spectrum by its smoothing. This representation captures only speaker- and microphone-invariant speech dynamics and no absolute or static acoustic properties such as spectrums are used. With them, speaker identity has to remain in speech representation. In our previous study, the new representation was applied to recognizing a sequence of isolated vowels [3]. The proposed method with a single training speaker outperformed the conventional HMMs trained with more than four thousand speakers even in the case of noisy speech. The current paper shows the initial results of applying the dynamic representation to recognizing continuous speech, that is connected vowels.

Full Paper

Bibliographic reference.  Asakawa, Satoshi / Minematsu, Nobuaki / Hirose, Keikichi (2007): "Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics", In INTERSPEECH-2007, 890-893.