International Symposium on Chinese Spoken Language Processing (ISCSLP 2002)

Taipei, Taiwan
August 23-24, 2002

Developing Chinese TAK for Computer Directly

Guo-Ping Hu, Ben-Feng Chen, Ren-Hua Wang

University of Science and Technology of China, Hefei, China

With the development of text analysis, the quality of the computer-used knowledge is more and more crucial to the analysis accuracy, and the text analysis knowledge (TAK) has also developed by many researchers. But so far, except the lexicon, TAK for computer (such as phrase structure grammar, unregistered word recognition rule, etc) is done on a small scale. Although large scale corpus with word segmentation annotation and even treebank has been developed, all these projects contribute limitedly to the text parser compared with the huge workload of the annotation, especially in Chinese domain. Considering the disadvantages of the data-mining and training technology used in text analysis field, aiming at one TTS system, this paper demonstrates a complete set of solutions to develop Chinese TAK for computer, including lexicon tree, nesting phrase structure grammar, combination-bigram, developing flow with computerís aid, and checking and improving the quality of the TAK automatically with the treebank (the treebank is the by-product of this development). This paper also shows that a text analysis system based on the construction result hits an accuracy rate of 80% in a close testing set of 24700 sentences, and approximately 50% tested on an open corpus. It is thus deduced that directly developing Chinese TAK for computer is more effective than other approaches under same workload.

