This paper introduces two Transformer-based architectures for Mispronunciation Detection and Diagnosis (MDD). The first Transformer architecture (T-1) is a standard setup with an encoder, a decoder, a projection part and the Cross Entropy (CE) loss. T-1 takes in Mel-Frequency Cepstral Coefficients (MFCC) as input. The second architecture (T-2) is based on wav2vec 2.0, a pretraining framework. T-2 is composed of a CNN feature encoder, several Transformer blocks capturing contextual speech representations, a projection part and the Connectionist Temporal Classification (CTC) loss. Unlike T-1, T-2 takes in raw audio data as input. Both models are trained in an end-to-end manner. Experiments are conducted on the CU-CHLOE corpus, where T-1 achieves a Phone Error Rate (PER) of 8.69% and F-measure of 77.23%; and T-2 achieves a PER of 5.97% and F-measure of 80.98%. Both models significantly outperform the previously proposed AGPM and CNN-RNN-CTC models, with PERs at 11.1% and 12.1% respectively, and F-measures at 72.61% and 74.65% respectively.
Cite as: Wu, M., Li, K., Leung, W.-K., Meng, H. (2021) Transformer Based End-to-End Mispronunciation Detection and Diagnosis. Proc. Interspeech 2021, 3954-3958, doi: 10.21437/Interspeech.2021-1467
@inproceedings{wu21h_interspeech, author={Minglin Wu and Kun Li and Wai-Kim Leung and Helen Meng}, title={{Transformer Based End-to-End Mispronunciation Detection and Diagnosis}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3954--3958}, doi={10.21437/Interspeech.2021-1467} }