ISCA Archive ICSLP 1998
ISCA Archive ICSLP 1998

A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition

Chao Wang, Stephanie Seneff

Prosodic cues (namely, fundamental frequency, energy and duration) provide important information for speech. For a tonal language such as Chinese, fundamental frequency (F0) plays a critical role in characterizing tone as well, which is an essential phonemic feature. In this paper, we describe our work on duration and tone modeling for telephone-quality continuous Mandarin digits, and the application of these models to improve recognition. The duration modeling includes a speaking-rate normalization scheme. A novel F0 extraction algorithm is developed, and parameters based on orthonormal decomposition of the F0 contour are extracted for tone recognition. Context dependency is expressed by ``tri-tone'' models clustered into broad classes. A 20.0% error rate is achieved for four-tone classification. Over a baseline recognition performance of 5.1% word error rate, we achieve 31.4% error reduction with duration models, 23.5% error reduction with tone models, and 39.2% error reduction with duration and tone models combined.


doi: 10.21437/ICSLP.1998-140

Cite as: Wang, C., Seneff, S. (1998) A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition. Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998), paper 0535, doi: 10.21437/ICSLP.1998-140

@inproceedings{wang98c_icslp,
  author={Chao Wang and Stephanie Seneff},
  title={{A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition}},
  year=1998,
  booktitle={Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998)},
  pages={paper 0535},
  doi={10.21437/ICSLP.1998-140}
}