A Deep Learning Approach to Assessing Non-native Pronunciation of English Using Phone Distances

Konstantinos Kyriakopoulos, Kate Knill, Mark Gales


The way a non-native speaker pronounces the phones of a language is an important predictor of their proficiency. In grading spontaneous speech, the pairwise distances between generative statistical models trained on each phone have been shown to be powerful features. This paper presents a deep learning alternative to model-based phone distances in the form of a tunable Siamese network feature extractor to extract distance metrics directly from the audio frame sequence. Features are extracted at the phone instance level and combined to phone-level representations using an attention mechanism. Pair-wise distances between phone features are then projected through a feed-forward layer to predict score. The extraction stage is initialised on either a binary phone instance-pair classification task, or to mimic the model-based features, then the whole system is fine-tuned end-to-end, optimising the learning of the distance metric to the score prediction task. This method is therefore more adaptable and more sensitive to phone instance level phenomena. Its performance is compared against a DNN trained on Gaussian phone model distance features.


 DOI: 10.21437/Interspeech.2018-1087

Cite as: Kyriakopoulos, K., Knill, K., Gales, M. (2018) A Deep Learning Approach to Assessing Non-native Pronunciation of English Using Phone Distances. Proc. Interspeech 2018, 1626-1630, DOI: 10.21437/Interspeech.2018-1087.


@inproceedings{Kyriakopoulos2018,
  author={Konstantinos Kyriakopoulos and Kate Knill and Mark Gales},
  title={A Deep Learning Approach to Assessing Non-native Pronunciation of English Using Phone Distances},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1626--1630},
  doi={10.21437/Interspeech.2018-1087},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1087}
}