International Symposium on Chinese Spoken Language Processing
August 23-24, 2002
Investigation and Analysis on Designing Chinese Balance Corpus
Rile Hu (1), Chengqing Zong (1), Juha Iso-Sipila (2), Bo Xu (1)
(1) Chinese Academy of Sciences, Beijing, China
Recently, the statistical methods have become the main
methods in the research of computational linguistics and
natural language processing. The corpus is the basis of the
statistical method. How to keep the balance in corpus
collection is an important issue. In this paper, we report the
results of our investigation and analysis on some real corpus,
and propose a scheme to keep the balance in corpus design.
Suggestions for the composition in corpus design are also
presented in this paper.
(2) Nokia China R&D Center, Beijing, China
Hu, Rile / Zong, Chengqing / Iso-Sipila, Juha / Xu, Bo (2002):
"Investigation and analysis on designing Chinese balance corpus",
In ISCSLP 2002, paper 110.