ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition

April 13-16, 2003
Tokyo Institute of Technology, Tokyo, Japan

Effects of Acoustic and Language Knowledge of Human and Automatic Speech Recognizer on Spontaneous Speech Perception/Recognition

Norihide Kitaoka, Masahisa Shingu, Seiichi Nakagawa

Department of Information and Computer Sciences, Toyohashi University of Technology, Aichi, Japan

An automatic speech recognizer uses acoustic knowledge and linguistic knowledge in large vocabulary speech recognition, acoustic knowledge is modeled by hidden Markov models (HMM), linguistic knowledge is modeled by N-gram (typically bi-gram or trigram), and these models are stochastically integrated. lt is thought that humans also integrate acoustic and linguistic knowledge of speech when perceiving continuous speech. Automatic speech recognition with HMM and N-gram is thought to roughly model the process of human pereeption.

Although these models have drastically improved the performanee of automatic speech recognition of well-formed read speech so far, they cannot deliver sufficient performance on spontaneous speech recognition tasks because of various particular phenomena of spontanous speech.

In this paper, we conducted simulation experiments of N-gram language models by combining human acoustic knowledge and instruction of local context and assured that using two words neighboring the target word was enough to improve the performance of recognition when we could use only local information as linguistic knowledge. We also assured that coarticulation affected the perception of syllahles.

We then compared some language models on speech recognizer. We calculated acoustic scores with HMM and then linguistic scores calculated from a language model were added. We obtained 37.5% recognition rate only with acoustic model, whereas we obtained 51.0% with both acoustic and language models, thus the relative performance improvement was 36%. We obtained a l6.5% recognition rate only with the language model, so the acoustic model improved the performance relatively 209%. Thus, the improvement of the acoustic models is more effective than that of the language model.


Full Paper

Bibliographic reference.  Kitaoka, Norihide / Shingu, Masahisa / Nakagawa, Seiichi (2003): "Effects of acoustic and language knowledge of human and automatic speech recognizer on spontaneous speech perception/recognition", in SSPR-2003, paper MAP14.