16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

On Speaker Adaptation of Long Short-Term Memory Recurrent Neural Networks

Yajie Miao, Florian Metze

Carnegie Mellon University, USA

Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture specializing in modeling long-range temporal dynamics. On acoustic modeling tasks, LSTM-RNNs have shown better performance than DNNs and conventional RNNs. In this paper, we conduct an extensive study on speaker adaptation of LSTM-RNNs. Speaker adaptation helps to reduce the mismatch between acoustic models and testing speakers. We have two main goals for this study. First, on a benchmark dataset, the existing DNN adaptation techniques are evaluated on the adaptation of LSTM-RNNs. We observe that LSTM-RNNs can be effectively adapted by using speaker-adaptive (SA) front-end, or by inserting speaker-dependent (SD) layers. Second, we propose two adaptation approaches that implement the SD-layer-insertion idea specifically for LSTM-RNNs. Using these approaches, speaker adaptation improves word error rates by 3-4% relative over a strong LSTM-RNN baseline. This improvement is enlarged to 6-7% if we exploit SA features for further adaptation.

Full Paper

Bibliographic reference.  Miao, Yajie / Metze, Florian (2015): "On speaker adaptation of long short-term memory recurrent neural networks", In INTERSPEECH-2015, 1101-1105.