This paper is concerned with language modeling (LM) for large vocabulary speech recognition in Mandarin Chinese. As the language characteristics of Chinese are quite unique, we investigate some novel techniques in language modeling. We also borrow some of techniques that have been applied to other languages. Experiments have been conducted on the Call Home Mandarin, HUB4, and HUB5 corpora obtained from the Linguistic Data Consortium (LDC). The training set consists of 9.8 hours of spontaneous speech and 100K words in text. The test set consists of 1.6 hours of spontaneous speech and 20K words in text. We have found that our results compare favorably to the results reported in the literature.
Cite as: Leung, R.H.Y., Choy, C.-Y., Leung, H.C. (1999) Characteristics of Chinese language models for large vocabulary telephone speech. Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999), 1775-1778, doi: 10.21437/Eurospeech.1999-355
@inproceedings{leung99_eurospeech, author={Roger H.Y. Leung and Chi-Yan Choy and Hong C. Leung}, title={{Characteristics of Chinese language models for large vocabulary telephone speech}}, year=1999, booktitle={Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999)}, pages={1775--1778}, doi={10.21437/Eurospeech.1999-355} }