Speaker adaptation of deep neural networks (DNNs) based acoustic models is still a challenging area of research. Considering that long short-term memory (LSTM) recurrent neural networks (RNNs) have been successfully applied to many sequence prediction and sequence labeling tasks, we propose to use LSTM RNNs for modeling speaker variability in automatic speech recognition (ASR). Firstly, the LSTM RNNs are used for extracting d-vectors (deep vector), which are then concatenated with the raw features for acoustic models. The speaker information provided by d-vectors helps DNNs based acoustic models figure out the speaker normalization during training. Furthermore, motivated by the idea that speech message can also be useful for speaker recognition, a new network called as cross-LSTM is proposed, which consist of two LSTMs: one for classifying speakers and the other for classifying senones. As a result, the speaker recognition and speech recognition are conducted simultaneously. Experiments are conducted on a conversational telephone speech corpus. Experimental results show the proposed models are effective for alleviating speaker variability in ASR, and yield 6% relative improvement for the LSTMP RNNs based systems.
Bibliographic reference. Li, Xiangang / Wu, Xihong (2015): "Modeling speaker variability using long short-term memory networks for speech recognition", In INTERSPEECH-2015, 1086-1090.