15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Language Identification of Code Switching Sentences and Multilingual Sentences of Under-Resourced Languages by Using Multi Structural Word Information

Yin-Lai Yeong, Tien-Ping Tan

Universiti Sains Malaysia, Malaysia

Language identification (LID) is a process to identify the languages used in a text or speech. Code switching is the switching of a language in a sentence or speech utterance. This paper focuses on LID of words in code switching sentences. Code switching can occur intersentential or intrasentential. The reasons why a writer switches from one language to another due to various reasons and among them are the inability to express opinion in a particular target language, to attract attention, to address different audience, habitual expressions and so on. The difficulty in identifying the languages of each word in a code switching sentence is because the languages have the same character set. In addition, code switching can happen in a sentence as short as a word or as long as a sentence. In this paper, we propose an automatic LID for words in code switching sentences by using multi structural word information (MUSWI) such as grapheme, syllable and word structure and calculate by using n-gram statistical model. The proposed MUSWI approach achieves 96.36% in term of accuracy on the code switching sentences, 99.07% on the multilingual sentences (non-code switching) which are under-resourced and closely related languages.

Full Paper

Bibliographic reference.  Yeong, Yin-Lai / Tan, Tien-Ping (2014): "Language identification of code Switching sentences and multilingual sentences of under-resourced languages by using multi structural word information", In INTERSPEECH-2014, 3052-3055.