12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Language Identification for Text Chats

Vesa Siivola, Bryan Pellom, Meagan Sills

Rosetta Stone Labs, USA

This work aims to classify the language of typed messages in a text chat system used by language learners. A method for training a language classifier from unlabeled data is presented. A dictionary-based method is used to produce initial classification of the messages. Character based n-gram models of order 3 and 5 are built. A method for selectively choosing the n-grams to be modeled is used to train 15-gram models. This method produces the best-performing classifier. It has models for 57 languages and obtains over 95% accuracy on the classification of messages that are unambiguously in one language.

Full Paper

Bibliographic reference.  Siivola, Vesa / Pellom, Bryan / Sills, Meagan (2011): "Language identification for text chats", In INTERSPEECH-2011, 2929-2932.