We investigate how to assess the prosody quality of an ESL learner’s spoken sentence against native speaker’s natural recording or TTS synthesized voice. A spoken English utterance read by an ESL leaner is compared with the recording of a native speaker, or TTS voice. The corresponding F0 contours (with voicings) and breaks are compared at the mapped syllable level via a DTW. The correlations between the prosody patterns of learner and native speaker (or TTS voice) of the same sentence are computed after the speech rates and F0 distributions between speakers are equalized. Based upon collected native and non-native speakers’ databases and correlation coefficients, we use Gaussian mixtures to model them as continuous distributions for training a two-class (native vs non-native) neural net classifier. We found that classification accuracy between using native speaker’s and TTS reference is close, i.e., 91.2% vs 88.1%. To assess the prosody proficiency of an ESL learner with one sentence input, the prosody patterns of our high quality TTS is almost as effective as those of native speakers’ recordings, which are more expensive and inconvenient to collect.
Cite as: Xiao, Y., Soong, F.K. (2017) Proficiency Assessment of ESL Learner’s Sentence Prosody with TTS Synthesized Voice as Reference. Proc. Interspeech 2017, 1755-1759, doi: 10.21437/Interspeech.2017-64
@inproceedings{xiao17_interspeech, author={Yujia Xiao and Frank K. Soong}, title={{Proficiency Assessment of ESL Learner’s Sentence Prosody with TTS Synthesized Voice as Reference}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1755--1759}, doi={10.21437/Interspeech.2017-64} }