In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method based on weighted finite state transducers to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique segmentation. The multiple segmentation generates more N-grams from the training corpus and allows obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmentations lead to a better performance than the unique segmentation approach.
Bibliographic reference. Seng, Sopheap / Besacier, Laurent / Bigi, Brigitte / Castelli, Eric (2009): "Multiple text segmentation for statistical language modeling", In INTERSPEECH-2009, 2663-2666.