Attentive Sequence-to-Sequence Learning for Diacritic Restoration of YorùBá Language Text

Iroro Orife


Yorùbá is a widely spoken West African language with a writing system rich in tonal and orthographic diacritics. With very few exceptions, diacritics are omitted from electronic texts, due to limited device and application support. Diacritics provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any Yorùbá text-to-speech (TTS), automatic speech recognition (ASR) and natural language processing (NLP) tasks. Reframing Automatic Diacritic Restoration (ADR) as a machine translation task, we experiment with two different attentive Sequence-to-Sequence neural models to process undiacritized text. On our evaluation dataset, this approach produces diacritization error rates of less than 5%. We have released pre-trained models, datasets and source-code as an open-source project to advance efforts on Yorùbá language technology.


 DOI: 10.21437/Interspeech.2018-42

Cite as: Orife, I. (2018) Attentive Sequence-to-Sequence Learning for Diacritic Restoration of YorùBá Language Text. Proc. Interspeech 2018, 2848-2852, DOI: 10.21437/Interspeech.2018-42.


@inproceedings{Orife2018,
  author={Iroro Orife},
  title={Attentive Sequence-to-Sequence Learning for Diacritic Restoration of YorùBá Language Text},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2848--2852},
  doi={10.21437/Interspeech.2018-42},
  url={http://dx.doi.org/10.21437/Interspeech.2018-42}
}