Speaker adaptation of deep neural networks (DNN) is difficult, and most commonly performed by changes to the input of the DNNs. Here we propose to learn discriminative feature transformations to obtain speaker normalised bottleneck (BN) features. This is achieved by interpreting the final two hidden layers as speaker specific matrix transformations. The hidden layer weights are updated with data from a specific speaker to learn speaker-dependent discriminative feature transformations. Such simple implementation lends itself to rapid adaptation and flexibility to be used in Speaker Adaptive Training (SAT) frameworks. The performance of this approach is evaluated on a meeting recognition task, using the official NIST RT'07 and RT'09 evaluation test sets. Supervised adaptation of the BN layer shows similar performance to the application of supervised CMLLR as a global transformation, and the combination of these appears to be additive. In unsupervised mode, CMLLR adaptation only yields 3.4% and 2.5% relative word error rate (WER) improvement, on the RT'07 and RT'09 respectively, where the baselines include speaker based cepstral mean and variance normalisation. The combined CMLLR and BN layer speaker adaptation yields a relative WER gain of 4.5% and 4.2% respectively. SAT style BN layer adaptation is attempted and combined with conventional CMLLR SAT, to show that it provides a relative gain of 1.43% and 2.02% on the RT'07 and RT'09 data sets respectively when compared with CMLLR SAT. While the overall gain from BN layer adaptation is small, the results are found to be statistically significant on both the test sets.
Bibliographic reference. Doddipatla, Rama / Hasan, Madina / Hain, Thomas (2014): "Speaker dependent bottleneck layer training for speaker adaptation in automatic speech recognition", In INTERSPEECH-2014, 2199-2203.