![]() |
Modeling Pronunciation Variation for Automatic Speech RecognitionRolduc, The Netherlands |
![]() ![]() |
Current-generation automatic speech recognition (ASR) systems model spoken discourse as a linear sequence of words and phones. Because it is unusual for every phone within a word to be pronounced in a standard ("canonical") way, ASR systems often depend on a multi-pronunciation lexicon to match an acoustic sequence with a lexical unit. Since there are, in practice, many different ways for a word to be pronounced, this standard approach adds a layer of complexity and ambiguity to the decoding process which, if modified, could potentially improve recognition performance. Systematic analysis of pronunciation variation in a corpus of spontaneous English discourse (Switchboard) demonstrates that the variation observed is systematic at the level of the syllable. Syllabic onsets are realized in canonical form far more frequently than either coda or nuclear constituents. Prosodic stress also plays an important role in pronunciation. The governing mechanism is likely to involve the informational valence associated with syllable elements, and for this reason pronunciation variation offers a potential window onto the mechanisms responsible for the production and understanding of speech. "The little things are infinitely the most important" - Arthur Conan Doyle
Bibliographic reference. Greenberg, Steven (1998): "Speaking in shorthand - a syllable-centric perspective for understanding pronunciation variation", In MPV-1998, 47-56.