ISCA Archive SLTU 2014
ISCA Archive SLTU 2014

A robust diacritics restoration system using unreliable raw text data

Lucian Petrică, Horia Cucu, Andi Buzo, Corneliu Burileanu

Statistical language models are utilized in many speech processing algorithms, e.g., automatic speech recognition (ASR). Such a model is created from a text corpus, but many of the text corpora for Romanian are unreliable with respect to the use of diacritic marks, i.e., diacritics are either partially or completely missing, resulting in low quality language models. We present a methodology for restoring diacritic marks to an unreliable text corpus, which requires no text resources apart from the corpus itself. The proposed methodology (i) identifies sections of the input corpus which are correct with respect to the use of diacritics, (ii) utilizes these sections to train a diacritics restoration system (DRS), and (iii) utilizes the DRS to correct the remaining sections of the corpus. We compare the DRS trained at (ii) with state-of-the-art systems, and observe up to 12% improvement with regard to the correctness of diacritic restoration. Furthermore, we utilize our methodology to create improved language models for the ASR system developed by the SpeeD laboratory, and demonstrate a decrease of 14% in perplexity and a 20% reduction of the out-of-vocabulary rate as a result.

Index Terms: Diacritics, speech recognition


Cite as: Petrică, L., Cucu, H., Buzo, A., Burileanu, C. (2014) A robust diacritics restoration system using unreliable raw text data. Proc. 4th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2014), 215-220

@inproceedings{petrica14_sltu,
  author={Lucian Petrică and Horia Cucu and Andi Buzo and Corneliu Burileanu},
  title={{A robust diacritics restoration system using unreliable raw text data}},
  year=2014,
  booktitle={Proc. 4th Workshop on Spoken Language Technologies for Under-Resourced Languages  (SLTU 2014)},
  pages={215--220}
}