September 22-25, 1997
One limitation of many speaker independent recognition systems is their dependence on a single baseform dictionary to model word pronunciations. These dictionaries typically contain only a single (or 'ideal') pronunciation for each word. Previous work on improving dictionary models to include multiple pronunciations has met with mixed success - the alternatives may increase ambiguity in some cases. This paper investigates two approaches to improve lexical baseforms. The first is a 'bottom-up' approach in which 'ideal' transcriptions of utterances looked up in a pronunciation dictionary are compared to phonemic level hand-annotated transcriptions. Analysing the differences between the two transcriptions reveals many coomon mispronunciations, accent-based alternatives, false-starts and incorrect word substitutions. Each of these problems is illustrated in the paper, where it is also shown that unfamiliar words are prone to large numbers of alternative pronunciations. The second approach is more 'top-down'. Phonologically developed rules and transforms are described which modify the lexical representation of the utterance and a pronunciation network is thus derived. This approach has the advantage of being able to explicitly model cross-word coarticulation effects, whereas the former approach models them implicitly to a certain extent. The relative merits of each technique are investigated using a set of experiments performed on a phonetically rich database.
Bibliographic reference. Downey, Simon / Wiseman, Richard (1997): "Dynamic and static improvements to lexical baseforms", In EUROSPEECH-1997, 1027-1030.