ISCA Archive Interspeech 2007
ISCA Archive Interspeech 2007

Towards better language modeling for Thai LVCSR

Markpong Jongtaveesataporn, Issara Thienlikit, Chai Wutiwiwatchai, Sadaoki Furui

One of the difficulties of Thai language modeling is the process of text corpus preparation. Because there is no explicit word boundary marker in written Thai text, word segmentation must be performed prior to training a language model. This paper presents two approaches to language model construction for Thai LVCSR based on pseudo-morpheme merging. The first approach merges pseudo-morphemes using forward and reverse bi-grams. The second approach utilizes the C4.5 decision tree to merge pseudo-morphemes based on multiple features. The performance of ASR systems with language models built using these methods are better than systems which use only pseudo-morpheme or lexicon-based word segmentation. These approaches produce results comparable to that obtained by the system utilizing manual segmentation.

doi: 10.21437/Interspeech.2007-447

Cite as: Jongtaveesataporn, M., Thienlikit, I., Wutiwiwatchai, C., Furui, S. (2007) Towards better language modeling for Thai LVCSR. Proc. Interspeech 2007, 1553-1556, doi: 10.21437/Interspeech.2007-447

  author={Markpong Jongtaveesataporn and Issara Thienlikit and Chai Wutiwiwatchai and Sadaoki Furui},
  title={{Towards better language modeling for Thai LVCSR}},
  booktitle={Proc. Interspeech 2007},