ISCA Archive SLTU 2012
ISCA Archive SLTU 2012

Web-based corpus acquisition for Swahili language modelling

Alexander Kivaisi, Audrey Mbogho

Finding large amounts of text data for use in natural language technology is difficult for under-resourced languages such as Swahili. The corpora that are readily accessible for these languages are not sufficient to be used in language technologies, whose requirements can run into the hundreds of millions of words. This paper describes how we can take advantage of search engines such as Google together with crawling tools to collect Swahili text from the Web. We also share the experience of cleaning up and normalising the resulting text data. Finally, we show some preliminary results of the evaluation of the language models built from our corpus as well as results of how they compare to those built from the Helsinki Corpus.

Index Terms: Under-resourced languages, corpus acquisition, Swahili, language model


Cite as: Kivaisi, A., Mbogho, A. (2012) Web-based corpus acquisition for Swahili language modelling. Proc. 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2012), 42-47

@inproceedings{kivaisi12_sltu,
  author={Alexander Kivaisi and Audrey Mbogho},
  title={{Web-based corpus acquisition for Swahili language modelling}},
  year=2012,
  booktitle={Proc. 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2012)},
  pages={42--47}
}