5th International Conference on Spoken Language Processing
Most current state-of-the-art large-vocabulary continuous speech recognition (LVCSR) systems are based on state-clustered hidden Markov models (HMMs). Typical systems use thousands of state clusters, each represented by a Gaussian mixture model with a few tens of Gaussians. In this paper, we show that models with far more parameter tying, like phonetically tied mixture (PTM) models, give better performance in terms of both recognition accuracy and speed. In particular, we achieved between a 5 and 10% improvement in word error rate, while cutting the number of Gaussian distance computations in half, for three different Wall Street Journal (WSJ) test sets, by using a PTM system with 38 phone-class state clusters, as compared to a state-clustered system with 937 state clusters. For both systems, the total number of Gaussians was fixed at about 30,000. This result is of real practical significance as we show that a conceptually simpler PTM system can achieve faster and more accurate performance than current state-of-the-art state-clustered HMM systems.
Bibliographic reference. Sankar, Ananth (1998): "A new look at HMM parameter tying for large vocabulary speech recognition", In ICSLP-1998, paper 0193.