ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings

Ewald van der Westhuizen, Thomas Niesler

Code-switching is prevalent among South African speakers, and presents a challenge to automatic speech recognition systems. It is predominantly a spoken phenomenon, and generally does not occur in textual form. Therefore a particularly serious challenge is the extreme lack of training material for language modelling. We investigate the use of word embeddings to synthesise isiZulu-to-English code-switch bigrams with which to augment such sparse language model training data. A variety of word embeddings are trained on a monolingual English web text corpus, and subsequently queried to synthesise code-switch bigrams. Our evaluation is performed on language models trained on a new, although small, English-isiZulu code-switch corpus compiled from South African soap operas. This data is characterised by fast, spontaneously spoken speech containing frequent code-switching. We show that the augmentation of the training data with code-switched bigrams synthesised in this way leads to a reduction in perplexity.

doi: 10.21437/Interspeech.2017-1437

Cite as: Westhuizen, E.v.d., Niesler, T. (2017) Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings. Proc. Interspeech 2017, 72-76, doi: 10.21437/Interspeech.2017-1437

  author={Ewald van der Westhuizen and Thomas Niesler},
  title={{Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings}},
  booktitle={Proc. Interspeech 2017},