15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Adaptation of Deep Neural Network Acoustic Models Using Factorised I-Vectors

Penny Karanasou, Yongqiang Wang, Mark J. F. Gales, Philip C. Woodland

University of Cambridge, UK

The use of deep neural networks (DNNs) in a hybrid configuration is becoming increasingly popular and successful for speech recognition. One issue with these systems is how to efficiently adapt them to reflect an individual speaker or noise condition. Recently speaker i-vectors have been successfully used as an additional input feature for unsupervised speaker adaptation. In this work the use of i-vectors for adaptation is extended to incorporate acoustic factorisation. In particular, separate i-vectors are computed to represent speaker and acoustic environment. By ensuring “orthogonality” between the individual factor representations it is possible to represent a wide range of speaker and environment pairs by simply combining i-vectors from a particular speaker and a particular environment. In this paper the i-vectors are viewed as the weights of a cluster adaptive training (CAT) system, where the underlying models are GMMs rather than HMMs. This allows the factorisation approaches developed for CAT to be directly applied. Initial experiments were conducted on a noise distorted version of the WSJ corpus. Compared to standard speaker-based i-vector adaptation, factorised i-vectors showed performance gains.

Full Paper

Bibliographic reference.  Karanasou, Penny / Wang, Yongqiang / Gales, Mark J. F. / Woodland, Philip C. (2014): "Adaptation of deep neural network acoustic models using factorised i-vectors", In INTERSPEECH-2014, 2180-2184.