Some languages have very consistent mappings between graphemes and phonemes, while in other languages, this mapping is more ambiguous. Consonantal writing systems prove to be a challenge for Text to Speech Systems (TTS) because they do not indicate short vowels, which creates an ambiguity in pronunciation. Special letter-to-sound rules may be needed for some cases in languages that otherwise have a good correspondence between graphemes and phonemes. In the low-resource scenario, we may not have linguistic resources such as diacritizers or hand-written rules for the language. We propose a technique to automatically learn pronunciations iteratively from acoustics during TTS training and predict pronunciations from text during synthesis time. We conduct experiments on dialects of Arabic for disambiguating homographs and Hindi for discovering the schwa-deletion rules. We evaluate our systems using objective and subjective metrics of TTS and show significant improvements for dialects of Arabic. Our methods can be generalized to other languages that exhibit similar phenomena.
Bibliographic reference. Sitaram, Sunayana / Jeblee, Serena / Black, Alan W. (2015): "Using acoustics to improve pronunciation for synthesis of low resource languages", In INTERSPEECH-2015, 259-263.