Multi-Task Learning for Mispronunciation Detection on Singapore Children’s Mandarin Speech

Rong Tong, Nancy F. Chen, Bin Ma


Speech technology for children is more challenging than for adults, because there is a lack of children’s speech corpora. Moreover, there is higher heterogeneity in children’s speech due to variability in anatomy across age and gender, larger variance in speaking rate and vocal effort, and immature command of word usage, grammar, and linguistic structure. Speech productions from Singapore children possess even more variability due to the multilingual environment in the city-state, causing inter-influences from Chinese languages (e.g., Hokkien and Mandarin), English dialects (e.g., American and British), and Indian languages (e.g., Hindi and Tamil). In this paper, we show that acoustic modeling of children’s speech can leverage on a larger set of adult data. We compare two data augmentation approaches for children’s acoustic modeling. The first approach disregards the child and adult categories and consolidates the two datasets together as one entire set. The second approach is multi-task learning: during training the acoustic characteristics of adults and children are jointly learned through shared hidden layers of the deep neural network, yet they still retain their respective targets using two distinct softmax layers. We empirically show that the multi-task learning approach outperforms the baseline in both speech recognition and computer-assisted pronunciation training.


 DOI: 10.21437/Interspeech.2017-520

Cite as: Tong, R., Chen, N.F., Ma, B. (2017) Multi-Task Learning for Mispronunciation Detection on Singapore Children’s Mandarin Speech. Proc. Interspeech 2017, 2193-2197, DOI: 10.21437/Interspeech.2017-520.


@inproceedings{Tong2017,
  author={Rong Tong and Nancy F. Chen and Bin Ma},
  title={Multi-Task Learning for Mispronunciation Detection on Singapore Children’s Mandarin Speech},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2193--2197},
  doi={10.21437/Interspeech.2017-520},
  url={http://dx.doi.org/10.21437/Interspeech.2017-520}
}