Leveraging Text Data for Word Segmentation for Underresourced Languages

Thomas Glarner, Benedikt Boenninghoff, Oliver Walter, Reinhold Haeb-Umbach


In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segmentation performance by a large margin.


 DOI: 10.21437/Interspeech.2017-1262

Cite as: Glarner, T., Boenninghoff, B., Walter, O., Haeb-Umbach, R. (2017) Leveraging Text Data for Word Segmentation for Underresourced Languages. Proc. Interspeech 2017, 2143-2147, DOI: 10.21437/Interspeech.2017-1262.


@inproceedings{Glarner2017,
  author={Thomas Glarner and Benedikt Boenninghoff and Oliver Walter and Reinhold Haeb-Umbach},
  title={Leveraging Text Data for Word Segmentation for Underresourced Languages},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2143--2147},
  doi={10.21437/Interspeech.2017-1262},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1262}
}