This work aims to classify the language of typed messages in a text chat system used by language learners. A method for training a language classifier from unlabeled data is presented. A dictionary-based method is used to produce initial classification of the messages. Character based n-gram models of order 3 and 5 are built. A method for selectively choosing the n-grams to be modeled is used to train 15-gram models. This method produces the best-performing classifier. It has models for 57 languages and obtains over 95% accuracy on the classification of messages that are unambiguously in one language.
Bibliographic reference. Siivola, Vesa / Pellom, Bryan / Sills, Meagan (2011): "Language identification for text chats", In INTERSPEECH-2011, 2929-2932.