Multimodal Articulation-Based Pronunciation Error Detection with Spectrogram and Acoustic Features

Sabrina Jenne, Ngoc Thang Vu


Articulation-based pronunciation error detection is concerned with the task of diagnosing mispronounced segments in non-native speech on the level of broad phonological properties, such as place of articulation or voicing. Using acoustic features and visual spectrograms extracted from native English utterances, we train several neural classifiers that deduce articulatory properties from segments extracted from non-native English utterances. Visual cues are thereby processed by convolutional neural networks, whereas acoustic cues are processed by recurrent neural networks.

We show that combining both modalities increases performance over using models in isolation, with important implications for user satisfaction. Furthermore, we test the impact of alignment quality on model performance by comparing results on manually corrected segments and force-aligned segments, showing that the proposed pipeline can dispense with manual correction.


 DOI: 10.21437/Interspeech.2019-1677

Cite as: Jenne, S., Vu, N.T. (2019) Multimodal Articulation-Based Pronunciation Error Detection with Spectrogram and Acoustic Features. Proc. Interspeech 2019, 3549-3553, DOI: 10.21437/Interspeech.2019-1677.


@inproceedings{Jenne2019,
  author={Sabrina Jenne and Ngoc Thang Vu},
  title={{Multimodal Articulation-Based Pronunciation Error Detection with Spectrogram and Acoustic Features}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3549--3553},
  doi={10.21437/Interspeech.2019-1677},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1677}
}