ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

Leveraging Text Data for Word Segmentation for Underresourced Languages

Thomas Glarner, Benedikt Boenninghoff, Oliver Walter, Reinhold Haeb-Umbach

In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segmentation performance by a large margin.

doi: 10.21437/Interspeech.2017-1262

Cite as: Glarner, T., Boenninghoff, B., Walter, O., Haeb-Umbach, R. (2017) Leveraging Text Data for Word Segmentation for Underresourced Languages. Proc. Interspeech 2017, 2143-2147, doi: 10.21437/Interspeech.2017-1262

  author={Thomas Glarner and Benedikt Boenninghoff and Oliver Walter and Reinhold Haeb-Umbach},
  title={{Leveraging Text Data for Word Segmentation for Underresourced Languages}},
  booktitle={Proc. Interspeech 2017},