Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

A Text Categorization Approach to Automatic Language Identification

Sheng Gao (1), Bin Ma (1), Haizhou Li (1), Chin-Hui Lee (2)

(1) Institute for Infocomm Research, Singapore; (2) Georgia Institute of Technology, Atlanta, GA, USA

We propose a novel approach to spoken language identification (LID). In this framework, a group of utterances from a particular language is treated as a "spoken document" characterized by a "document vector". The collection of spoken documents in the training set from the same language forms a specific "language identification category". An unknown testing utterance to be identified can also be represented as a query vector, such that LID is accomplished just like in the case of associating a text document to a topic. This process is known as text categorization (TC). The key lies in tokenizing speech signals with a set of "key terms" so that their salient patterns and corresponding statistics can be used to discriminate individual spoken languages. To perform LID we can adopt any classifier learning and feature extraction techniques developed in the TC community. When compared with the prevailing parallel PRLM method, the proposed approach achieves a relative error reduction of about 87.5%, and reaches an error rate of 0.2% and 1.54% for 3 and 6 languages, respectively, with queries of about 10 seconds long.

Full Paper

Bibliographic reference.  Gao, Sheng / Ma, Bin / Li, Haizhou / Lee, Chin-Hui (2005): "A text categorization approach to automatic language identification", In INTERSPEECH-2005, 2837-2840.