Sixth International Conference on Spoken Language Processing
This paper proposes a syllable context dependent model for spontaneous speech recognition. It is generally assumed that, since spontaneous speech is greatly affected by coarticulation, an acoustic model featuring a longer range phonemic context is required to achieve a high degree of recognition accuracy. This motivated the authors to investigate a tri-syllable model that takes differences in the preceding and succeeding syllables into account. Since Japanese syllables consist of either a single vowel or a consonant and vowel combination, a tri-syllable model always takes the preceding and succeeding vowels that are the primary factors in coarticulation into account. A tri-syllable model is thus capable of efficiently representing coarticulation. The tri-syllable model was trained using spontaneous speech; then, the effectiveness of continuous syllable recognition and statistical language model-based continuous word recognition were evaluated. Compared to a regular triphone model without state sharing, it was found that the correct syllable accuracy of the continuous syllable recognition improved from 64.9% to 66.3%. The word recognition accuracy for the statistical language modelbased continuous word recognition improved from 88.4% to 89.2%.
Bibliographic reference. Hanazawa, Toshiyuki / Ishii, Jun / Okato, Yohei / Nakajima, Kunio (2000): "Acoustic modeling for spontaneous speech recognition using syllable dependent models", In ICSLP-2000, vol.4, 157-160.