In this paper we describe the development of an annotated Chinese conversational textual corpus for speech recognition in a speech-to-speech translation system in the travel domain. A total of 515,000 manually checked utterances were constructed, which provided a 3.5 million word Chinese corpus with word segmentation and part-of-speech tagging. The annotation is conducted with careful manual checking. The specifications on word segmentation and POS-tagging are designed to follow the main existing Chinese corpora that are widely accepted by researchers of Chinese natural language processing. Many particular features of conversational texts are also taken into account. With this corpus, parallel corpora are obtained together with the corresponding pairs of Japanese and English texts from which the Chinese was translated. To evaluate the corpus, the language models built by it are evaluated using perplexity and speech recognition accuracy as criteria. The perplexity of the Chinese language model is verified as having reached a reasonably low level. Recognition performance is also found to be comparable to the other two languages, even though the quantity of training data for Chinese is only half the other two languages.
Bibliographic reference. Hu, Xinhui / Isotani, Ryosuke / Kawai, Hisashi / Nakamura, Satoshi (2010): "Construction and evaluations of an annotated Chinese conversational corpus in travel domain for the language model of speech recognition", In INTERSPEECH-2010, 1910-1913.