Deep neural networks (DNN) are currently very successful for acoustic modeling in ASR systems. One of the main challenges with DNNs is unsupervised speaker adaptation from an initial speaker clustering, because DNNs have a very large number of parameters. Recently, a method has been proposed to adapt DNNs to speakers by combining speaker-specific information (in the form of i-vectors computed at the speaker-cluster level) with fMLLR-transformed acoustic features. In this paper we try to gain insight on what kind of adaptation is performed on DNNs when stacking i-vectors with acoustic features and what information exactly is carried by i-vectors. We observe on REPERE corpus that DNNs trained on i-vector features concatenated with fMLLR-transformed acoustic features lead to a gain of 0.7 points. The experiments shows that using i-vector stacking in DNN acoustic models is not only performing speaker adaptation, but also adaptation to acoustic conditions.
Bibliographic reference. Rouvier, Mickael / Favre, Benoit (2014): "Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers?", In INTERSPEECH-2014, 3007-3011.