EUROSPEECH 2003  INTERSPEECH 2003

Most automatic speech recognition (ASR) systems use Hidden Markov model (HMM) with a diagonalcovariance Gaussian mixture model for the stateconditional probability density function. The diagonalcovariance Gaussian mixture can model discrete sources of variability like speaker variations, gender variations, or local dialect, but can not model continuous types of variability that account for correlation between the elements of the feature vector. In this paper, we present a transformation of the acoustic feature vector that minimizes an empirical estimate of the relative entropy between the likelihood based on the diagonalcovariance Gaussian mixture HMM model and the true likelihood. Based on this formulation, we provide a solution to the problem using volumepreserving maps; existing linear feature transform designs are shown to be special cases of the proposed solution. Since most of the acoustic features used in ASR are not linear functions of the sources of correlation in the speech signal, we use a nonlinear transformation of the features to minimize this objective function. We describe an iterative algorithm to estimate the parameters of both the volumepreserving feature transformation and the HMM that jointly optimize the objective function for an HMMbased speech recognizer. Using this algorithm, we achieved 2% improvement in phoneme recognition accuracy compared to the baseline system. Our approach shows also improvement in recognition accuracy compared to previous linear approaches like linear discriminant analysis (LDA), maximum likelihood linear transform (MLLT), and independent component analysis (ICA).
Bibliographic reference. Omar, Mohamed Kamal / HasegawaJohnson, Mark (2003): "Nonlinear maximum likelihood feature transformation for speech recognition", In EUROSPEECH2003, 24972500.