ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

Identify language origin of personal names with normalized appearance number of web pages

Jiali You, Yining Chen, Min Chu, Yong Zhao, Jinlin Wang

Identifying the language origin of a personal name without context is interesting and useful in many areas. Morphological structure, which has long been considered as the main source of language origin information, is modeled by N-grams of letters or letter chunks. In this paper, we introduce a new information source, the appearance number of a name in web pages of different languages, for identifying its language origin. Since the distribution of web pages in various languages is not identical, and the state-of-the-art search engines can only provide the number of pages that contain the queried words, we propose a method to normalize the appearance number obtained from a search engine and use it as a new feature. When this new feature is used independently to identify language origin of names among four closely related languages (English, German, French, and Portuguese), the error rate is 26.9%, which is comparable to that of letter 4-gram features. When it is used together with the letter N-gram models, the error rate is reduced to 14.2%, which is about 43.2% error reduction, compared with the letter 4-gram based baseline model.


doi: 10.21437/Interspeech.2006-395

Cite as: You, J., Chen, Y., Chu, M., Zhao, Y., Wang, J. (2006) Identify language origin of personal names with normalized appearance number of web pages. Proc. Interspeech 2006, paper 1353-Tue3BuP.15, doi: 10.21437/Interspeech.2006-395

@inproceedings{you06_interspeech,
  author={Jiali You and Yining Chen and Min Chu and Yong Zhao and Jinlin Wang},
  title={{Identify language origin of personal names with normalized appearance number of web pages}},
  year=2006,
  booktitle={Proc. Interspeech 2006},
  pages={paper 1353-Tue3BuP.15},
  doi={10.21437/Interspeech.2006-395}
}