Automatic speech recognition of a tonal and syllabic language such as Chinese Mandarin poses new challenges but also offers new opportunities. We present approaches and experimental results concerning the choice of base units for acoustic modeling, pitch estimation and how to integrate pitch estimates into the modeling framework. The experimental evaluations are carried out both on rather clean headset data and on noisy and reverberant distant talking speech data. Results show that tonal base units offer a word error rate reduction of more than 30% compared to toneless base units. This holds for both phoneme models and initial-final models. The integration of pitch as an additional feature stream yields another remarkable improvement of more than 20% over the best tonal baseline system. In a two-stream modeling approach, the pitch stream distributions can be strongly tied such that the overall model size increases only very moderately.
Cite as: Sun, Y., Willett, D., Brueckner, R., Gruhn, R., Bühler, D. (2006) Experiments on Chinese speech recognition with tonal models and pitch estimation using the Mandarin speecon data. Proc. Interspeech 2006, paper 1452-Tue3A2O.6, doi: 10.21437/Interspeech.2006-374
@inproceedings{sun06_interspeech, author={Ying Sun and Daniel Willett and Raymond Brueckner and Rainer Gruhn and Dirk Bühler}, title={{Experiments on Chinese speech recognition with tonal models and pitch estimation using the Mandarin speecon data}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1452-Tue3A2O.6}, doi={10.21437/Interspeech.2006-374} }