EUROSPEECH '97

We refer to environment e as some combination of speaker, handset, transmission channel and noise background condition, and regard any practical situation of a speech recognizer as a mixture of environments. A speech recognizer may be trained on multienvironment data. It may also need to adapt the trained acoustic models to new conditions. How to train an HMM with multienvironment data and from what seed model to start an adaptation are two questions of great importance. We propose a new solution to speech recognition which is based on, for both training and adaptation, a separate modeling of phonetic variation and environment variations. This problem is formulated under hidden Markov process, where we assume,  Speech x is generated by some canonical (independent ofenvironmental factors) distributions,  An unknown linear transformation We and a bias be, specific to environment e, is applied to x with probability P(e),  x cannot be observed, what we observe is the outcome of the transformation: o = Wex + be. Under maximumlikelihood (ML) criterion, by application of EM algorithm and the extension of Baum's forward and backward variables and algorithm, we obtained a joint solution to the parameters of the canonical distributions, the transformations and the biases, which is novel. For special cases, on a noisy telephone speech database, the new formulation is compared to perutterance cepstral mean normalization (CMN) technique and shows more than 20% word error rate improvement.
Bibliographic reference. Gong, Yifan (1997): "Source normalization training for HMM applied to noisy telephone speech recognition", In EUROSPEECH1997, 15551558.