We propose an ensemble modeling framework to jointly characterize speaker and speaking environments for robust speech recognition. We represent a particular environment by a super-vector formed by concatenating the entire set of mean vectors of the Gaussian mixture components in its corresponding hidden Markov model set. In the training phase we generate an ensemble speaker and speaking environment super-vector by concatenating all the super-vectors trained on data from many real or simulated environments. In the recognition phase the ensemble speaker and speaking environment super-vector is converted to the super-vector for the testing environment with an affine transformation that is estimated online with a maximum likelihood (ML) algorithm. We used a simplified formulation for the proposed approach and evaluated its performance on the Aurora 2 database. In an unsupervised adaptation mode, the proposed approach achieves 7.27% and 13.68% WER reductions, respectively, when tested in clean and averaged noisy conditions (from 0dB to 20dB) over the baseline performance on a gender dependent system. The results suggest that the proposed approach can well characterize environments under the presence of either single or multiple distortion sources.
Bibliographic reference. Tsao, Yu / Lee, Chin-Hui (2007): "An ensemble modeling approach to joint characterization of speaker and speaking environments", In INTERSPEECH-2007, 1050-1053.