ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

Deep neural network training emphasizing central frames

Gakuto Kurata, Daniel Willett

It is common practice to concatenate several consecutive frames of acoustic features as input of a Deep Neural Network (DNN) for speech recognition. A DNN is trained to map the concatenated frames as a whole to the HMM state corresponding to the center frame while the side frames close to both ends of the concatenated frames and the remaining central frames are treated as equally important. Though the side frames are relevant to the HMM state of the center frame, this relationship may not be fully generalized to unseen data. Thus putting more emphasis on the central frames than on the side frames avoids over-fitting to the DNN training data. We propose a new DNN training method to emphasize the central frames. We first conduct pre-training and fine-tuning with only the central frames and then conduct fine-tuning with all of the concatenated frames. In large vocabulary continuous speech recognition experiments with more than 1,000 hours of data for DNN training, we obtained a relative error rate reduction of 1.68%, which was statistically significant.

doi: 10.21437/Interspeech.2015-713

Cite as: Kurata, G., Willett, D. (2015) Deep neural network training emphasizing central frames. Proc. Interspeech 2015, 3595-3599, doi: 10.21437/Interspeech.2015-713

  author={Gakuto Kurata and Daniel Willett},
  title={{Deep neural network training emphasizing central frames}},
  booktitle={Proc. Interspeech 2015},