Hierarchical Recurrent Neural Networks for Acoustic Modeling

Jinhwan Park, Iksoo Choi, Yoonho Boo, Wonyong Sung


Recurrent neural network (RNN)-based acoustic models are widely used in speech recognition and end-to-end training with CTC (connectionist temporal classification) shows good performance. In order to improve the ability to keep temporarily distant information, we employ hierarchical recurrent neural networks (HRNNs) to the acoustic modeling in speech recognition. HRNN consists of multiple RNN layers that operate on different time-scales and the frequency of operation at each layer is controlled by learned gates from training data. We employ gate activation regularization techniques to control the operation of the hierarchical layers. When tested with the WSJ eval92, our best model obtained the word error rate of 5.19% with beam search decoding using RNN based character-level language models. Compared to an LSTM based acoustic model with a similar parameter size, we achieved a relative word error rate improvement of 10.5%. Even though this model employs uni-directional RNN models, it showed the performance improvements over the previous bi-directional RNN based acoustic models.


 DOI: 10.21437/Interspeech.2018-1797

Cite as: Park, J., Choi, I., Boo, Y., Sung, W. (2018) Hierarchical Recurrent Neural Networks for Acoustic Modeling. Proc. Interspeech 2018, 3728-3732, DOI: 10.21437/Interspeech.2018-1797.


@inproceedings{Park2018,
  author={Jinhwan Park and Iksoo Choi and Yoonho Boo and Wonyong Sung},
  title={Hierarchical Recurrent Neural Networks for Acoustic Modeling},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3728--3732},
  doi={10.21437/Interspeech.2018-1797},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1797}
}