8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


Using Place Name Data to Train Language Identification Models

Stanley F. Chen, Benoit Maison

IBM T.J. Watson Research Center, USA

The language of origin of a name affects its pronunciation, so language identification is an important technology for speech synthesis and recognition. Previous work on this task has typically used training sets that are proprietary or limited in coverage. In this work, we investigate the use of a publically-available geographic database for training language ID models. We automatically cluster place names by language, and show that models trained from place name data are effective for language ID on person names. In addition, we compare several source-channel and direct models for language ID, and achieve a 24% reduction in error rate over a source-channel letter trigram model on a 26-way language ID task.

Full Paper

Bibliographic reference.  Chen, Stanley F. / Maison, Benoit (2003): "Using place name data to train language identification models", In EUROSPEECH-2003, 1349-1352.