Linguistically Motivated Parallel Data Augmentation for Code-Switch Language Modeling

Grandee Lee, Xianghu Yue, Haizhou Li


Code-switch language modeling is challenging due to data scarcity as well as expanded vocabulary that involves two languages. We present a novel computational method to generate synthetic code-switch data using the Matrix Language Frame theory to alleviate the issue of data scarcity. The proposed method makes use of augmented parallel data to supplement the real code-switch data. We use the synthetic data to pre-train the language model. We show that the pre-trained language model can match the performance of vanilla models when it is finetuned with 2.5 times less real code-switch data. We also show that the perplexity of a RNN based language model pre-trained on synthetic code-switch data and fine-tuned with real code-switch data is significantly lower than that of the model trained on real code-switch data alone and the reduction in perplexity translates into 1.45% absolute reduction in WER in a speech recognition experiment.


 DOI: 10.21437/Interspeech.2019-1382

Cite as: Lee, G., Yue, X., Li, H. (2019) Linguistically Motivated Parallel Data Augmentation for Code-Switch Language Modeling. Proc. Interspeech 2019, 3730-3734, DOI: 10.21437/Interspeech.2019-1382.


@inproceedings{Lee2019,
  author={Grandee Lee and Xianghu Yue and Haizhou Li},
  title={{Linguistically Motivated Parallel Data Augmentation for Code-Switch Language Modeling}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3730--3734},
  doi={10.21437/Interspeech.2019-1382},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1382}
}