The use of deep neural networks (DNNs) in a hybrid configuration is becoming increasingly popular and successful for speech recognition. One issue with these systems is how to efficiently adapt them to reflect an individual speaker or noise condition. Recently speaker i-vectors have been successfully used as an additional input feature for unsupervised speaker adaptation. In this work the use of i-vectors for adaptation is extended to incorporate acoustic factorisation. In particular, separate i-vectors are computed to represent speaker and acoustic environment. By ensuring orthogonality between the individual factor representations it is possible to represent a wide range of speaker and environment pairs by simply combining i-vectors from a particular speaker and a particular environment. In this paper the i-vectors are viewed as the weights of a cluster adaptive training (CAT) system, where the underlying models are GMMs rather than HMMs. This allows the factorisation approaches developed for CAT to be directly applied. Initial experiments were conducted on a noise distorted version of the WSJ corpus. Compared to standard speaker-based i-vector adaptation, factorised i-vectors showed performance gains.
Bibliographic reference. Karanasou, Penny / Wang, Yongqiang / Gales, Mark J. F. / Woodland, Philip C. (2014): "Adaptation of deep neural network acoustic models using factorised i-vectors", In INTERSPEECH-2014, 2180-2184.