8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Language Identification of Person Names Using CF-IOF Based Weighing Function

Samuel Thomas, Ashish Verma

IBM India Research Lab, India

Information about the language of origin helps in generating pronunciation for foreign words, specially person names, in a text-to-speech synthesis system. It can be used to apply language specific letter-to-sound (LTS) rules to these words during synthesis. In this paper, we propose a novel approach for using substrings of a person name (called letter N-grams) to identify the language of its origin. We use a weight for the letter N-grams that is motivated by the techniques used in text document classification, different from the usual N-gram probabilities used in earlier approaches. We also propose a tree based approach to select the letter N-grams of different lengths for language identification. Several experiments have been conducted to evaluate the performance of the proposed approach and compare it with those of the earlier proposed approaches based on N-gram probabilities. We show an improvement in classification results over the earlier approaches without using any language specific rules.

Full Paper

Bibliographic reference.  Thomas, Samuel / Verma, Ashish (2007): "Language identification of person names using CF-IOF based weighing function", In INTERSPEECH-2007, 1769-1772.