ISCA Archive Interspeech 2015
ISCA Archive Interspeech 2015

On speaker adaptation of long short-term memory recurrent neural networks

Yajie Miao, Florian Metze

Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture specializing in modeling long-range temporal dynamics. On acoustic modeling tasks, LSTM-RNNs have shown better performance than DNNs and conventional RNNs. In this paper, we conduct an extensive study on speaker adaptation of LSTM-RNNs. Speaker adaptation helps to reduce the mismatch between acoustic models and testing speakers. We have two main goals for this study. First, on a benchmark dataset, the existing DNN adaptation techniques are evaluated on the adaptation of LSTM-RNNs. We observe that LSTM-RNNs can be effectively adapted by using speaker-adaptive (SA) front-end, or by inserting speaker-dependent (SD) layers. Second, we propose two adaptation approaches that implement the SD-layer-insertion idea specifically for LSTM-RNNs. Using these approaches, speaker adaptation improves word error rates by 3-4% relative over a strong LSTM-RNN baseline. This improvement is enlarged to 6-7% if we exploit SA features for further adaptation.

doi: 10.21437/Interspeech.2015-290

Cite as: Miao, Y., Metze, F. (2015) On speaker adaptation of long short-term memory recurrent neural networks. Proc. Interspeech 2015, 1101-1105, doi: 10.21437/Interspeech.2015-290

  author={Yajie Miao and Florian Metze},
  title={{On speaker adaptation of long short-term memory recurrent neural networks}},
  booktitle={Proc. Interspeech 2015},