Speech and Language Technology in Education (SLaTE2007)

The Summit Inn, Farmington, PA, USA
October 1-3, 2007

Are Learners Myna Birds to the Averaged Distributions of Native Speakers? - A Note ofWarning from a Serious Speech Engineer -

Nobuaki Minematsu

Graduate School of Frontier Sciences, The University of Tokyo, Japan

The current speech recognition technology consists of clearly separate modules of acoustic models, language models, a pronunciation dictionary, and a decoder. CALL systems often use the acoustic matching module to compare a learner’s utterance to the templates stored in the systems. The acoustic template of a phrase is usually calculated by collecting utterances of that phrase spoken by native speakers and estimating their averaged distribution. If phoneme-based comparison is required, phoneme-based templates should be prepared and Hidden Markov Models are often adopted for training the templates. In this framework, a learner’s utterance is acoustically and directly compared to the averaged distributions. And then, the notorious mismatch problem more or less inevitably happens. I wonder whether this framework is pedagogically-sound enough. No children acquire language through imitating their parents’ voices acoustically. Male learners don’t have to produce female voices even when a female teacher asks them to repeat her. What in a learner’s utterance should be acoustically matched with what in a teacher’s utterance? I consider that the current speech technology does not have any good answers and this paper proposes a good candidate answer by regarding speech as music.

Full Paper

Bibliographic reference.  Minematsu, Nobuaki (2007): "Are learners myna birds to the averaged distributions of native speakers? - a note ofwarning from a serious speech engineer -", In SLaTE-2007, 100-103.