In this paper, we propose word-level hidden Markov models (HMMs) to supplement state-of-the-art phone-based acoustic modeling in order to enhance the performance of automatic speech recognition (ASR) system. Each word in a vocabulary is initially modeled by well-trained triphone models. Maximum a posteriori adaptation is then applied to generate models for words with a large number of occurrences in the training set so that the acoustic distribution of the words can be modeled more precisely. Experimental results show that the proposed word-based systems outperform phone-based systems on the TIMIT task with a small training corpus. While in tasks with plenty of training data, word-based systems still show improvements over phone-based systems, such as the WSJ task. Furthermore the word-based systems have a better discriminating ability on short words and homophones. They are also more robust to language model weight variation than conventional phone-based systems.
Index Terms: word-level HMM, automatic speech recognition, detection-based ASR, language model weight, homophone
Bibliographic reference. Chen, I-Fan / Lee, Chin-Hui (2012): "A study on using word-level HMMs to improve ASR performance over state-of-the-art phone-level acoustic modeling for LVCSR", In INTERSPEECH-2012, 1788-1791.