ISCA Archive SLTU 2012
ISCA Archive SLTU 2012

Quantifying the effect of corpus size on the quality of automatic diacritization of Yorùbá texts

Tunde Adegbola, Lydia Uchechukwu Odilinye

Yorùbá being a tone language requires tone in-formation for the correct pronunciation of words in Text-to-Speech synthesis. Based on standard Yorùbá orthography, such infor-mation is held in tone marks, which applied to vowels and syllabic nasals as diacritical mark-ings. However, the tone marks are not always correctly applied in many Yorùbá documents because appropriate input devices for the accu-rate application of the diacritic marks are not always available. Hence, the absence of tone marks in most written Yorùbá texts presents a major challenge in speech synthesis as the in-formation required for applying the right tone sequences to synthesized Yorùbá speech may not always be available. This study proposes the use of Machine Learning techniques as a basis for the automatic application of tone marks as part of the pre-processing in high level synthesis. Being a resource-scarce language however, there is a lack of sufficiently large Yorùbá corpora for the training of an au-tomatic diacritizer. The study therefore investigated the relationship between corpus size and the quality of automatic diacritization to-wards estimating the size of corpus required for an ideal level of accuracy.


Cite as: Adegbola, T., Odilinye, L.U. (2012) Quantifying the effect of corpus size on the quality of automatic diacritization of Yorùbá texts. Proc. 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2012), 48-53

@inproceedings{adegbola12_sltu,
  author={Tunde Adegbola and Lydia Uchechukwu Odilinye},
  title={{Quantifying the effect of corpus size on the quality of automatic diacritization of Yorùbá texts}},
  year=2012,
  booktitle={Proc. 3rd Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2012)},
  pages={48--53}
}