ISCA Archive Interspeech 2008
Language modeling for speech recognition of spoken Cantonese

Yu Ting Yeung, Houwei Cao, N. H. Zheng, Tan Lee, P. C. Ching

This paper addresses the problem of language modeling for LVCSR of Cantonese spoken in daily communication. As a spoken dialect, Cantonese is not used in written documents and published materials. Thus it is difficult to collect sufficient amount of written Cantonese text data for the training of statistical language models. We propose to solve this problem by translating standard Chinese text, which is much easier to find, into written Cantonese. A rulebased method of translation is devised and implemented. Three different language models are trained from different types of text. They are evaluated in the task of LVCSR. Experimental results confirm that the translated text can well represent Cantonese spoken in formal occasions like broadcast news. For colloquial Cantonese, language model adaptation with a limited amount of colloquial Cantonese text data would be a practically feasible solution that leads to reasonable speech recognition performance.

doi: 10.21437/Interspeech.2008-259

Cite as: Yeung, Y.T., Cao, H., Zheng, N.H., Lee, T., Ching, P.C. (2008) Language modeling for speech recognition of spoken Cantonese. Proc. Interspeech 2008, 1570-1573, doi: 10.21437/Interspeech.2008-259

