Identifying the language origin of a personal name without context is interesting and useful in many areas. Morphological structure, which has long been considered as the main source of language origin information, is modeled by N-grams of letters or letter chunks. In this paper, we introduce a new information source, the appearance number of a name in web pages of different languages, for identifying its language origin. Since the distribution of web pages in various languages is not identical, and the state-of-the-art search engines can only provide the number of pages that contain the queried words, we propose a method to normalize the appearance number obtained from a search engine and use it as a new feature. When this new feature is used independently to identify language origin of names among four closely related languages (English, German, French, and Portuguese), the error rate is 26.9%, which is comparable to that of letter 4-gram features. When it is used together with the letter N-gram models, the error rate is reduced to 14.2%, which is about 43.2% error reduction, compared with the letter 4-gram based baseline model.
Cite as: You, J., Chen, Y., Chu, M., Zhao, Y., Wang, J. (2006) Identify language origin of personal names with normalized appearance number of web pages. Proc. Interspeech 2006, paper 1353-Tue3BuP.15, doi: 10.21437/Interspeech.2006-395
@inproceedings{you06_interspeech, author={Jiali You and Yining Chen and Min Chu and Yong Zhao and Jinlin Wang}, title={{Identify language origin of personal names with normalized appearance number of web pages}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1353-Tue3BuP.15}, doi={10.21437/Interspeech.2006-395} }