8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


Non-Linear Maximum Likelihood Feature Transformation for Speech Recognition

Mohamed Kamal Omar, Mark Hasegawa-Johnson

University of Illinois at Urbana-Champaign, USA

Most automatic speech recognition (ASR) systems use Hidden Markov model (HMM) with a diagonal-covariance Gaussian mixture model for the state-conditional probability density function. The diagonal-covariance Gaussian mixture can model discrete sources of variability like speaker variations, gender variations, or local dialect, but can not model continuous types of variability that account for correlation between the elements of the feature vector. In this paper, we present a transformation of the acoustic feature vector that minimizes an empirical estimate of the relative entropy between the likelihood based on the diagonal-covariance Gaussian mixture HMM model and the true likelihood. Based on this formulation, we provide a solution to the problem using volume-preserving maps; existing linear feature transform designs are shown to be special cases of the proposed solution. Since most of the acoustic features used in ASR are not linear functions of the sources of correlation in the speech signal, we use a non-linear transformation of the features to minimize this objective function. We describe an iterative algorithm to estimate the parameters of both the volume-preserving feature transformation and the HMM that jointly optimize the objective function for an HMM-based speech recognizer. Using this algorithm, we achieved 2% improvement in phoneme recognition accuracy compared to the baseline system. Our approach shows also improvement in recognition accuracy compared to previous linear approaches like linear discriminant analysis (LDA), maximum likelihood linear transform (MLLT), and independent component analysis (ICA).

Full Paper

Bibliographic reference.  Omar, Mohamed Kamal / Hasegawa-Johnson, Mark (2003): "Non-linear maximum likelihood feature transformation for speech recognition", In EUROSPEECH-2003, 2497-2500.