A Novel Approach for Effective Recognition of the Code-Switched Data on Monolingual Language Model

Sreeram Ganji, Rohit Sinha


Code-switching refers to the phenomena of mixing of words or phrases from foreign languages while communicating in a native language by the multilingual speakers. Code-switching is a global phenomenon and is widely accepted in multilingual communities. However, for training the language model (LM) for such tasks, a very limited code-switched textual resources are available as yet. In this work, we present an approach to reduce the perplexity (PPL) of Hindi-English code-switched data when tested over the LM trained on purely native Hindi data. For this purpose, we propose a novel textual feature which allows the LM to predict the code-switching instances. The proposed feature is referred to as code-switching factor (CS-factor). Also, we developed a tagger that facilitates the automatic tagging of the code-switching instances. This tagger is trained on a development data and assigns an equivalent class of foreign (English) words to each of the potential native (Hindi) words. For this study, the textual resource has been created by crawling the blogs from a couple of websites educating about the usage of the Internet. In the context of recognition of the code-switching data, the proposed technique is found to yield a substantial improvement in terms of PPL.


 DOI: 10.21437/Interspeech.2018-1259

Cite as: Ganji, S., Sinha, R. (2018) A Novel Approach for Effective Recognition of the Code-Switched Data on Monolingual Language Model. Proc. Interspeech 2018, 1953-1957, DOI: 10.21437/Interspeech.2018-1259.


@inproceedings{Ganji2018,
  author={Sreeram Ganji and Rohit Sinha},
  title={A Novel Approach for Effective Recognition of the Code-Switched Data on Monolingual Language Model},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1953--1957},
  doi={10.21437/Interspeech.2018-1259},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1259}
}