Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings

Ewald van der Westhuizen, Thomas Niesler


Code-switching is prevalent among South African speakers, and presents a challenge to automatic speech recognition systems. It is predominantly a spoken phenomenon, and generally does not occur in textual form. Therefore a particularly serious challenge is the extreme lack of training material for language modelling. We investigate the use of word embeddings to synthesise isiZulu-to-English code-switch bigrams with which to augment such sparse language model training data. A variety of word embeddings are trained on a monolingual English web text corpus, and subsequently queried to synthesise code-switch bigrams. Our evaluation is performed on language models trained on a new, although small, English-isiZulu code-switch corpus compiled from South African soap operas. This data is characterised by fast, spontaneously spoken speech containing frequent code-switching. We show that the augmentation of the training data with code-switched bigrams synthesised in this way leads to a reduction in perplexity.


 DOI: 10.21437/Interspeech.2017-1437

Cite as: Westhuizen, E.V.D., Niesler, T. (2017) Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings. Proc. Interspeech 2017, 72-76, DOI: 10.21437/Interspeech.2017-1437.


@inproceedings{Westhuizen2017,
  author={Ewald van der Westhuizen and Thomas Niesler},
  title={Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={72--76},
  doi={10.21437/Interspeech.2017-1437},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1437}
}