How to detect mispronounced words/phonemes and to provide appropriate, to-the-point diagnosis feedback correctly is a challenging task in Computer Aided Pronunciation Training. In this study, we propose a discriminative training algorithm to jointly optimize error detection performance (i.e. false rejection and false acceptance) and diagnosis feedback accuracy (i.e., pinpointing accurately the mispronounced words/phones and providing proper feedback). An optimization procedure, similar to the Minimum Word Error (MWE) discriminative training, is developed to refine the ML-trained HMMs. The errors to be minimized are obtained by comparing hand transcribed speech training utterances by phoneticians with canonical pronunciations of the words and common mispronunciations which are embedded in a “confusion network” (compiled by handcrafted rules or data-driven rules derived from labeled training data.) A database of 8,575 English utterances (split into 5,988 for training and 2,587 for testing) spoken by 100 Cantonese English learners is used to measure the performance of the new algorithm. Several conclusion can be drawn from the experiments: (1) data-driven rules are more effective than hand-crafted ones in capturing (modeling) mispronunciations; (2) compared with the ML training baseline, discriminative training can reduce the false rejection and diagnostic errors while degrading the false acceptance performance slightly due to a smaller number of false-acceptance samples in the training set.
Bibliographic reference. Qian, Xiaojun / Soong, Frank K. / Meng, Helen (2010): "Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT)", In INTERSPEECH-2010, 757-760.