ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

A text categorization approach to automatic language identification

Sheng Gao, Bin Ma, Haizhou Li, Chin-Hui Lee

We propose a novel approach to spoken language identification (LID). In this framework, a group of utterances from a particular language is treated as a "spoken document" characterized by a "document vector". The collection of spoken documents in the training set from the same language forms a specific "language identification category". An unknown testing utterance to be identified can also be represented as a query vector, such that LID is accomplished just like in the case of associating a text document to a topic. This process is known as text categorization (TC). The key lies in tokenizing speech signals with a set of "key terms" so that their salient patterns and corresponding statistics can be used to discriminate individual spoken languages. To perform LID we can adopt any classifier learning and feature extraction techniques developed in the TC community. When compared with the prevailing parallel PRLM method, the proposed approach achieves a relative error reduction of about 87.5%, and reaches an error rate of 0.2% and 1.54% for 3 and 6 languages, respectively, with queries of about 10 seconds long.

doi: 10.21437/Interspeech.2005-718

Cite as: Gao, S., Ma, B., Li, H., Lee, C.-H. (2005) A text categorization approach to automatic language identification. Proc. Interspeech 2005, 2837-2840, doi: 10.21437/Interspeech.2005-718

  author={Sheng Gao and Bin Ma and Haizhou Li and Chin-Hui Lee},
  title={{A text categorization approach to automatic language identification}},
  booktitle={Proc. Interspeech 2005},