Advances in Automatic Speech Recognition for Child Speech Using Factored Time Delay Neural Network

Fei Wu, Leibny Paola García-Perera, Daniel Povey, Sanjeev Khudanpur


Automatic speech recognition (ASR) has shown huge advances in adult speech; however, when the models are tested on child speech, the performance does not achieve satisfactory word error rates (WER). This is mainly due to the high variance in acoustic features of child speech and the lack of clean, labeled corpora. We apply the factored time delay neural network (TDNN-F) to the child speech domain, finding that it yields better performance. To enable our models to handle the different noise conditions and extremely small corpora, we augment the original training data by adding noise and reverberation. Compared with conventional GMM-HMM and TDNN systems, TDNN-F does better on two widely accessible corpora: CMU Kids and CSLU Kids, and on the combination of these two. Our system achieves a 26% relative improvement in WER.


 DOI: 10.21437/Interspeech.2019-2980

Cite as: Wu, F., García-Perera, L.P., Povey, D., Khudanpur, S. (2019) Advances in Automatic Speech Recognition for Child Speech Using Factored Time Delay Neural Network. Proc. Interspeech 2019, 1-5, DOI: 10.21437/Interspeech.2019-2980.


@inproceedings{Wu2019,
  author={Fei Wu and Leibny Paola García-Perera and Daniel Povey and Sanjeev Khudanpur},
  title={{Advances in Automatic Speech Recognition for Child Speech Using Factored Time Delay Neural Network}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1--5},
  doi={10.21437/Interspeech.2019-2980},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2980}
}