Paired Phone-Posteriors Approach to ESL Pronunciation Quality Assessment

Yujia Xiao, Frank Soong, Wenping Hu


This work proposes to incorporate paired phone-posteriors as input features into a neural net (NN) model for assessing ESL learner’s pronunciation quality. In this work, posteriors of forty phones, instead of several thousand sub-phonemic senones, are used to circumvent the sparsity issues in NN training. Phone posteriors are assembled with their corresponding senone posteriors estimated via a speaker-independent, DNN-based acoustic model, trained with standard American English speech data (i.e., Wall Street Journal database). Phone posteriors of both reference(standard American English speaker) and test speaker are paired together as augmented input feature vectors to train an NN based, 2-class, i.e., native vs nonnative speaker, classifier. The Goodness of Pronunciation (GOP), a proven effective measure, is used as the baseline for comparison. The binary NN classifier trained with such features achieves a high classification accuracy of 89.6% on native and non-native speakers’ data. The classifier also shows a better equal error rate (EER) than the GOP-based baseline classifier in either phone or word level pronunciation, i.e., at phone level from 18.3% to 6.2% and at word level from 12.98% to 2.54%.


 DOI: 10.21437/Interspeech.2018-1270

Cite as: Xiao, Y., Soong, F., Hu, W. (2018) Paired Phone-Posteriors Approach to ESL Pronunciation Quality Assessment. Proc. Interspeech 2018, 1631-1635, DOI: 10.21437/Interspeech.2018-1270.


@inproceedings{Xiao2018,
  author={Yujia Xiao and Frank Soong and Wenping Hu},
  title={Paired Phone-Posteriors Approach to ESL Pronunciation Quality Assessment},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1631--1635},
  doi={10.21437/Interspeech.2018-1270},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1270}
}