International Symposium on Chinese Spoken Language Processing (ISCSLP 2002)

Taipei, Taiwan
August 23-24, 2002

Chinese Person Name Identification Based on Rules and Statistics

Wenjie Cao (1), Chengqing Zong (1), Juha Iso-Sipila (2), Bo Xu (1)

(1) Chinese Academy of Sciences, Beijing, China
(2) Nokia China R&D Center, Beijing, China

This paper describes our strategies for automatic identification of Chinese person names in text. In our approach, we use bound words, bound rules and linguistic information, including parts of speech, dependency between words, etc., to represent the external context features of names. Bound rules are trained by real corpus. Based on one million Chinese person names, we have developed a probability model to represent the internal features of Chinese names. In the identification process, firstly, a potential Chinese person name is extracted by using the rules and characters that can be used as surnames. Secondly, the weight of the potential name is computed with the probability model. The potential names whose weights are below the threshold will be output as the real Chinese person names. Through open test, the precision rate of the system is 83.66%, and the recall rate is 93.50%.

