Modeling Pronunciation Variation for Automatic Speech Recognition

Rolduc, The Netherlands
May 4-6, 1998

Statistical Modelling of Pronunciation: It's Not The Model, It's The Data

Florian Schiel (1), Andreas Kipp (2), Hans-Günther Tillmann (2)

(1) Bavarian Archive for Speech Signals (BAS), Munich, Germany
(2) Department of Phonetics and Speech Communication, University of Munich, Germany

In this paper we describe a method to model pronunciation for ASR in the German VERBMOBIL task. Our findings suggest that a simple model, i.e. pronunciation variants modelled by SAM-PA units and weighted with a-posteriori probabilities, can be used successfully for ASR, if there is a sufficient amount of reliably transcribed speech data available. Manual segmentation and labelling of speech (especially spontaneous speech, as in the scheduling task of VERBMOBIL) is very expensive and time consuming and requires carefully trained experts and supervisors. Even with considerable effort it is not possible to produce broad phonetic transcripts for more than a small part of today customary speech databases. Therefore, as a first step in our approach we developed the fully automatic segmentation and labelling tool MAUS ('Munich Automatic Segmentation') for spontaneous German speech. The first part of our presentation will give a concise description of the MAUS method as well as an evaluation by comparing the results of MAUS with inter-labeller agreements of three expert phoneticians on the same data. The results show that MAUS operates within the range of human experts in terms of transcription while the timing information still lacks the quality of human segmenters. In a second step we used the MAUS system to segment and label 32h of speech in the 1996 VERBMOBIL acoustic evaluation to obtain more 320.000 transcribed words from the scheduling task. A simple counting, pruning and discounting technique (similar to that used for language modelling) is used to derive a probabilistic model of pronunciation. It provides a varying number of pronunciation variants per lexical entity together with the a-posteriori probability P(V|W) that a variant V is uttered given the lexical entity W. A baseline system using HTK was set up for the 1996 VERBMOBIL evaluation task using monophones and a 'most likely' pronunciation dictionary (the 'most likelihood' was judged by a human expert NOT by empiric data). A second system with statistical modelling of pronunciation together with a proper re-training of the acoustic models showed significant better results on the same task in terms of word accuracy. Prom these findings we conclude that there's more to be done to achieve reliable and precisely labelled and segmented speech data than to investigate into very complex models which are usually prune to over-generalisation and lexical ambiguity.

Full Paper

Bibliographic reference.  Schiel, Florian / Kipp, Andreas / Tillmann, Hans-Günther (1998): "Statistical modelling of pronunciation: it's not the model, it's the data", In MPV-1998, 131-136.