INTERSPEECH 2006 - ICSLP
This paper presents a novel acoustic modeling framework that naturally extends the Hidden Markov Model (HMM) approach. The novel models reduce the errors caused by speaker variability by means of a local spectral mismatch reduction. A more complex and flexible speech production scheme can be assumed, in which the local temporal and frequency elastic deformations of the speech are captured by the model. In the new framework the states of a standard HMM, which are usually associated with temporal transitions, are expanded so that a new degree of freedom for the model is provided and it is then possible to estimate an optimum frequency warping factor at the same time as the decoder finds the best state sequence. In the local spectral warping based models the states become time-frequency related states and the number of parameters of the model is comparable to the standard HMM since they share a certain amount of parameters as it will be shown. The novel models are evaluated in the noise-free TIDIGITS corpus, which includes connected digits uttered by male, female and children. It has been found that, under speaker group (age-gender) mismatch conditions, the local frequency warping reduced Word Error Rate (WER) in mean by a 70%, using the initial models. When matched speaker group conditions were tested the error was reduced in mean in a 9.7% after reestimating the models.
Bibliographic reference. Miguel, Antonio / Lleida, Eduardo / Juan, Alfons / Buera, Luis / Ortega, Alfonso / Saz, Oscar (2006): "Local transformation models for speech recognition", In INTERSPEECH-2006, paper 1275-Wed1BuP.13.