9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Incorporating Acoustical Modelling of Phone Transitions in an Hybrid ANN/HMM Speech Recognizer

Alberto Abad, João Neto

INESC-ID/IST, Portugal

Speech recognition based on connectionist approaches is one of the most successful alternatives to widespread Gaussian systems. One of the main claims against hybrid recognizers is the increased complexity for context-dependent phone modelling, which is a key aspect in medium to large size vocabulary tasks. In this paper, a baseline hybrid system based on monophone recognition units is improved by incorporating acoustical modelling of phone transitions. First, a single state monophone model is extended to multiple state sub-phoneme modelling. Then, a reduced set of diphone recognition units is incorporated to model phone transitions. The proposed approach shows a 26.8% and 23.8% relative word error rate reduction compared to baseline hybrid system in two selected WSJ evaluation test sets. Additionally, improved performance compared to a reference Gaussian system based on word-internal context-dependent triphones and comparable results to cross-word triphone system are reported.

