Speaker Adaptation for Attention-Based End-to-End Speech Recognition

Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition. The first method is Kullback-Leibler divergence (KLD) regularization, in which the output distribution of a speaker-dependent (SD) AED is forced to be close to that of the speaker-independent (SI) model by adding a KLD regularization to the adaptation criterion. To compensate for the asymmetric deficiency in KLD regularization, an adversarial speaker adaptation (ASA) method is proposed to regularize the deep-feature distribution of the SD AED through the adversarial learning of an auxiliary discriminator and the SD AED. The third approach is the multi-task learning, in which an SD AED is trained to jointly perform the primary task of predicting a large number of output units and an auxiliary task of predicting a small number of output units to alleviate the target sparsity issue. Evaluated on a Microsoft short message dictation task, all three methods are highly effective in adapting the AED model, achieving up to 12.2% and 3.0% word error rate improvement over an SI AED trained from 3400 hours data for supervised and unsupervised adaptation, respectively.

 DOI: 10.21437/Interspeech.2019-3135

Cite as: Meng, Z., Gaur, Y., Li, J., Gong, Y. (2019) Speaker Adaptation for Attention-Based End-to-End Speech Recognition. Proc. Interspeech 2019, 241-245, DOI: 10.21437/Interspeech.2019-3135.

  author={Zhong Meng and Yashesh Gaur and Jinyu Li and Yifan Gong},
  title={{Speaker Adaptation for Attention-Based End-to-End Speech Recognition}},
  booktitle={Proc. Interspeech 2019},