ISCA Archive Interspeech 2009
ISCA Archive Interspeech 2009

Multiple text segmentation for statistical language modeling

Sopheap Seng, Laurent Besacier, Brigitte Bigi, Eric Castelli

In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique segmentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmentations lead to a better performance than the unique segmentation approach.

doi: 10.21437/Interspeech.2009-119

Cite as: Seng, S., Besacier, L., Bigi, B., Castelli, E. (2009) Multiple text segmentation for statistical language modeling. Proc. Interspeech 2009, 2663-2666, doi: 10.21437/Interspeech.2009-119

  author={Sopheap Seng and Laurent Besacier and Brigitte Bigi and Eric Castelli},
  title={{Multiple text segmentation for statistical language modeling}},
  booktitle={Proc. Interspeech 2009},