In this paper, we study methods to enhance the precision of the online estimation process of a recently proposed approach, ensemble speaker and speaking environment modeling (ESSEM), and therefore improve its overall performance. The ESSEM approach consists of two integral phases, offline and online. In the offline phase, an ensemble environment configuration is prepared by a large collection of acoustic models. Each set of acoustic models represents a particular environment. In the online phase, with speech data from the testing condition, we estimate a mapping function and use it to generate a new set of acoustic models for that particular testing condition. In our previous study, we have discussed the issues of the offline process and proposed algorithms to refine the environment configuration. In this paper, we first study different online mapping structures and compare their performances on a same environment configuration. Next, we propose a multiple clustering matching algorithm to further improve the overall performance of ESSEM. We tested ESSEM and its extensions on the full evaluation set of the Aurora2 connected digit recognition task. When using our best offline environment configuration along with a properly specified online estimation method, the ESSEM approach can achieve an average word error rate (WER) of 4.77%, corresponding to a WER reduction of 13.43% (from 5.51% WER to 4.77% WER) over the baseline result.
Bibliographic reference. Tsao, Yu / Lee, Chin-Hui (2008): "Improving the ensemble speaker and speaking environment modeling approach by enhancing the precision of the online estimation process", In INTERSPEECH-2008, 1265-1268.