8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

A Tagging Algorithm for Mixed Language Identification in a Noisy Domain

Mike Rosner (1), Paulseph-John Farrugia (2)

(1) University of Malta, Malta
(2) MobIsle Communications Ltd., Malta

The bilingual nature of the Maltese Islands gives rise to frequent occurrences of code switching, both verbally and in writing. In designing a polyglot TTS system capable of handling SMS messages within the local context, it was necessary to come up with a pre-processing mechanism for identifying the language of origin of individual word tokens. Given that certain common words can be interlingually ambiguous and that the domain under consideration is both open and subject to containing various word contractions and spelling mistakes, the task is not as straightforward as it may seem at first. In this paper we discuss a language neutral language identification approach capable of handling the characteristics of the domain in a robust fashion.

Full Paper

Bibliographic reference.  Rosner, Mike / Farrugia, Paulseph-John (2007): "A tagging algorithm for mixed language identification in a noisy domain", In INTERSPEECH-2007, 190-193.