ISCA Archive ISCSLP 2002
ISCA Archive ISCSLP 2002

Chinese person name identification based on rules and statistics

Wenjie Cao, Chengqing Zong, Juha Iso-Sipila, Bo Xu

This paper describes our strategies for automatic identification of Chinese person names in text. In our approach, we use bound words, bound rules and linguistic information, including parts of speech, dependency between words, etc., to represent the external context features of names. Bound rules are trained by real corpus. Based on one million Chinese person names, we have developed a probability model to represent the internal features of Chinese names. In the identification process, firstly, a potential Chinese person name is extracted by using the rules and characters that can be used as surnames. Secondly, the weight of the potential name is computed with the probability model. The potential names whose weights are below the threshold will be output as the real Chinese person names. Through open test, the precision rate of the system is 83.66%, and the recall rate is 93.50%.


Cite as: Cao, W., Zong, C., Iso-Sipila, J., Xu, B. (2002) Chinese person name identification based on rules and statistics. Proc. International Symposium on Chinese Spoken Language Processing, paper 101

@inproceedings{cao02b_iscslp,
  author={Wenjie Cao and Chengqing Zong and Juha Iso-Sipila and Bo Xu},
  title={{Chinese person name identification based on rules and statistics}},
  year=2002,
  booktitle={Proc. International Symposium on Chinese Spoken Language Processing},
  pages={paper 101}
}