5th International Conference on Spoken Language Processing
Combining knowledge derived from both syllable- (100-250 ms) and phone-length (40-100 ms) intervals in the automatic speech recognition process can yield performance superior to that obtained using information derived from a single time scale alone. The results are particularly pronounced for reverberant test conditions that have not been incorporated into the training set. In the present study, phone- and syllable-based systems are combined at three distinct levels of the recognition process --- the frame, the syllable and the entire utterance. Each strategy successfully integrates the complementary strengths of the individual systems, yielding a significant improvement in accuracy on a small-vocabulary, naturally spoken, telephone speech corpus. The syllable-level combination outperformed the other two methods under both relatively pristine and moderately reverberant acoustic conditions, yielding a 20-40% relative improvement over the baseline.
Bibliographic reference. Wu, Su-Lin / Kingsbury, Brian E. D. / Morgan, Nelson / Greenberg, Steven (1998): "Performance improvements through combining phone- and syllable-scale information in automatic speech recognition", In ICSLP-1998, paper 0854.