Speech and Language Technology in Education (SLaTE2007)
The Summit Inn, Farmington, PA, USA
The current speech recognition technology consists of clearly separate modules of acoustic models, language models, a pronunciation dictionary, and a decoder. CALL systems often use the acoustic matching module to compare a learners utterance to the templates stored in the systems. The acoustic template of a phrase is usually calculated by collecting utterances of that phrase spoken by native speakers and estimating their averaged distribution. If phoneme-based comparison is required, phoneme-based templates should be prepared and Hidden Markov Models are often adopted for training the templates. In this framework, a learners utterance is acoustically and directly compared to the averaged distributions. And then, the notorious mismatch problem more or less inevitably happens. I wonder whether this framework is pedagogically-sound enough. No children acquire language through imitating their parents voices acoustically. Male learners dont have to produce female voices even when a female teacher asks them to repeat her. What in a learners utterance should be acoustically matched with what in a teachers utterance? I consider that the current speech technology does not have any good answers and this paper proposes a good candidate answer by regarding speech as music.
Bibliographic reference. Minematsu, Nobuaki (2007): "Are learners myna birds to the averaged distributions of native speakers? - a note ofwarning from a serious speech engineer -", In SLaTE-2007, 100-103.