Code-switching is prevalent among South African speakers, and presents a challenge to automatic speech recognition systems. It is predominantly a spoken phenomenon, and generally does not occur in textual form. Therefore a particularly serious challenge is the extreme lack of training material for language modelling. We investigate the use of word embeddings to synthesise isiZulu-to-English code-switch bigrams with which to augment such sparse language model training data. A variety of word embeddings are trained on a monolingual English web text corpus, and subsequently queried to synthesise code-switch bigrams. Our evaluation is performed on language models trained on a new, although small, English-isiZulu code-switch corpus compiled from South African soap operas. This data is characterised by fast, spontaneously spoken speech containing frequent code-switching. We show that the augmentation of the training data with code-switched bigrams synthesised in this way leads to a reduction in perplexity.
Cite as: Westhuizen, E.v.d., Niesler, T. (2017) Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings. Proc. Interspeech 2017, 72-76, doi: 10.21437/Interspeech.2017-1437
@inproceedings{westhuizen17_interspeech, author={Ewald van der Westhuizen and Thomas Niesler}, title={{Synthesising isiZulu-English Code-Switch Bigrams Using Word Embeddings}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={72--76}, doi={10.21437/Interspeech.2017-1437} }