8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


Comparison of Effects of Acoustic and Language Knowledge on Spontaneous Speech Perception/Recognition Between Human and Automatic Speech Recognizer

Norihide Kitaoka, Masahisa Shingu, Seiichi Nakagawa

Toyohashi University of Technology, Japan

An automatic speech recognizer uses acoustic knowledge and linguistic knowledge. In large vocabulary speech recognition, acoustic knowledge is modeled by hidden Markov models (HMM), linguistic knowledge is modeled by N-gram (typically bi-gram or trigram), and these models are stochastically integrated. It is thought that humans also integrate acoustic and linguistic knowledge of speech when perceiving continuous speech. Automatic speech recognition with HMM and N-gram is thought to roughly model the process of human perception.

Although these models have drastically improved the performance of automatic speech recognition of well-formed read speech so far, they cannot deliver sufficient performance on spontaneous speech recognition tasks because of various particular phenomena of spontaneous speech.

In this paper, we conducted simulation experiments of N-gram language models by combining human acoustic knowledge and instruction of local context and assured that using two words neighboring the target word was enough to improve the performance of recognition when we could use only local information as linguistic knowledge. We also assured that coarticulation affected the perception of short words.

We then compared some language models on speech recognizer. We calculated acoustic scores with HMM and then linguistic scores calculated from a language model were added. We obtained 37.5% recognition rate only with acoustic model, whereas we obtained 51.0% with both acoustic and language models, thus the relative performance improvement was 36%. On the other hand, we obtained a 16.5% recognition rate only with the language model, so the acoustic model improved the performance relatively 209%. The performance of the language model on spontaneous speech is almost equal to that on read speech and thus, the improvements of the acoustic models is more effective than that of the language model.

Full Paper

Bibliographic reference.  Kitaoka, Norihide / Shingu, Masahisa / Nakagawa, Seiichi (2003): "Comparison of effects of acoustic and language knowledge on spontaneous speech perception/recognition between human and automatic speech recognizer", In EUROSPEECH-2003, 2725-2728.