10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

High Performance Automatic Mispronunciation Detection Method Based on Neural Network and TRAP Features

Hongyan Li, Shijin Wang, Jiaen Liang, Shen Huang, Bo Xu

Chinese Academy of Sciences, China

In this paper, we propose a new approach to utilize temporal information and neural network (NN) to improve the performance of automatic mispronunciation detection (AMD). Firstly, the alignment results between speech signals and corresponding phoneme sequences are obtained within the classic GMM-HMM framework. Then, the long-time TempoRAl Patterns (TRAPs) [5] features are introduced to describe the pronunciation quality instead of the conventional spectral features (e.g. MFCC). Based on the phoneme boundaries and TRAPs features, we use Multi-layer Perceptron (MLP) to calculate the final posterior probability of each testing phoneme, and determine whether it is a mispronunciation or not by comparing with a phone dependent threshold. Moreover, we combine the TRAPs-MLP method with our existing methods to further improve the performance. Experiments show that the TRAPs-MLP method can give a significant relative improvement of 39.04% in EER (Equal Error Rate) reduction, and the fusion of TRAPs-MLP, GMM-UBM and GLDS-SVM [4] methods can yield 48.32% in EER reduction relatively, both compared with the baseline GMM-UBM method.

Full Paper

Bibliographic reference.  Li, Hongyan / Wang, Shijin / Liang, Jiaen / Huang, Shen / Xu, Bo (2009): "High performance automatic mispronunciation detection method based on neural network and TRAP features", In INTERSPEECH-2009, 1911-1914.