In this paper, we propose to use Deep Neural Net (DNN), which has been recently shown to reduce speech recognition errors significantly, in Computer-Aided Language Learning (CALL) to evaluate English learners' pronunciations. Multi-layer, stacked Restricted Boltzman Machines (RBMs), are first trained as nonlinear basis functions to represent speech signals succinctly, and the output layer is discriminatively trained to optimize the posterior probabilities of correct, sub-phonemic "senone" states. Three Goodness of Pronunciation (GOP) scores, including: the likelihood-based posterior probability, averaged frame-level posteriors of the DNN output layer "senone" nodes, and log likelihood ratio of correct and competing models, are tested with recordings of both native and non-native speakers, along with manual grading of pronunciation quality. The experimental results show that the GOP estimated by averaged frame-level posteriors of "senones" correlate with human scores the best. Comparing with GOPs estimated with non-DNN, i.e. GMM-HMM, based models, the new approach can improve the correlations relatively by 22.0% or 15.6%, at word or sentence levels, respectively. In addition, the frame-level posteriors, which doesnft need a decoding lattice and its corresponding forwardbackward computations, is suitable for supporting fast, on-line, multi-channel applications.
Bibliographic reference. Hu, Wenping / Qian, Yao / Soong, Frank K. (2013): "A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL)", In INTERSPEECH-2013, 1886-1890.