Context-dependent Deep Neural Network has obtained consistent and significant improvements over the Gaussian Mixture Model (GMM) based systems for various speech recognition tasks. However, since DNN is discriminatively trained, it is more sensitive to label errors and is not reliable for unsupervised adaptation. Moreover, DNN parameters do not have a clear and meaningful interpretation, therefore, it has been difficult to develop effective adaptation techniques for the DNNs. Nevertheless, unadapted multi-style trained DNNs have already shown superior performance to the GMM system with joint noise/speaker adaptation and adaptive training. Recently, Temporally Varying Weight Regression (TVWR) has been successfully applied to combine DNN and GMM for robust unsupervised speaker adaptation. In this paper, joint speaker/noise adaptation and adaptive training of TVWR using DNN posteriors are investigated for robust speech recognition. Experimental results on the Aurora 4 corpus showed that after joint adaptation and adaptive training, TVWR achieved 21.3% and 11.6% relative improvements over the DNN baseline system and the best system in currently reported literatures, respectively.
Bibliographic reference. Liu, Shilin / Sim, Khe Chai (2014): "Joint adaptation and adaptive training of TVWR for robust automatic speech recognition", In INTERSPEECH-2014, 636-640.