Third Workshop on Spoken Language Technologies for Under-resourced Languages

Cape Town, South Africa
May 7-9, 2012

Quantifying the Effect of Corpus Size on the Quality of Automatic Diacritization of Yorùbá Texts

Tunde Adegbola, Lydia Uchechukwu Odilinye

African Languages Technology Initiative, Ibadan, Nigeria

Yorùbá being a tone language requires tone in-formation for the correct pronunciation of words in Text-to-Speech synthesis. Based on standard Yorùbá orthography, such infor-mation is held in tone marks, which applied to vowels and syllabic nasals as diacritical mark-ings. However, the tone marks are not always correctly applied in many Yorùbá documents because appropriate input devices for the accu-rate application of the diacritic marks are not always available. Hence, the absence of tone marks in most written Yorùbá texts presents a major challenge in speech synthesis as the in-formation required for applying the right tone sequences to synthesized Yorùbá speech may not always be available. This study proposes the use of Machine Learning techniques as a basis for the automatic application of tone marks as part of the pre-processing in high level synthesis. Being a resource-scarce language however, there is a lack of sufficiently large Yorùbá corpora for the training of an au-tomatic diacritizer. The study therefore investigated the relationship between corpus size and the quality of automatic diacritization to-wards estimating the size of corpus required for an ideal level of accuracy.

