5th International Conference on Spoken Language Processing
Prosodic cues (namely, fundamental frequency, energy and duration) provide important information for speech. For a tonal language such as Chinese, fundamental frequency (F0) plays a critical role in characterizing tone as well, which is an essential phonemic feature. In this paper, we describe our work on duration and tone modeling for telephone-quality continuous Mandarin digits, and the application of these models to improve recognition. The duration modeling includes a speaking-rate normalization scheme. A novel F0 extraction algorithm is developed, and parameters based on orthonormal decomposition of the F0 contour are extracted for tone recognition. Context dependency is expressed by ``tri-tone'' models clustered into broad classes. A 20.0% error rate is achieved for four-tone classification. Over a baseline recognition performance of 5.1% word error rate, we achieve 31.4% error reduction with duration models, 23.5% error reduction with tone models, and 39.2% error reduction with duration and tone models combined.
Bibliographic reference. Wang, Chao / Seneff, Stephanie (1998): "A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition", In ICSLP-1998, paper 0535.