We investigate the problem of speaker adaptation of DNN acoustic models in two settings: the traditional unsupervised adaptation and a supervised adaptation (SuA) where a few minutes of transcribed speech is available. SuA presents additional difficulties when a test speaker's adaptation information does not match the registered speaker's information. Employing feature-space maximum likelihood linear regression (fMLLR) transformed features as side-information to the DNN, we reintroduce some classical ideas for combining adapted and unadapted features: early and late fusion methods, as well as the estimation of the fMLLR transforms using simple target models (STM). Results show that early fusion helps DNNs generalize better when features are combined after a non-linear bottleneck layer, while late fusion improves robustness, specifically in mismatched cases. STM give consistent improvements in both settings.
Bibliographic reference. Parthasarathi, Sree Hari Krishnan / Hoffmeister, Bjorn / Matsoukas, Spyros / Mandal, Arindam / Strom, Nikko / Garimella, Sri (2015): "fMLLR based feature-space speaker adaptation of DNN acoustic models", In INTERSPEECH-2015, 3630-3634.