8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Towards Better Language Modeling for Thai LVCSR

Markpong Jongtaveesataporn (1), Issara Thienlikit (1), Chai Wutiwiwatchai (2), Sadaoki Furui (1)

(1) Tokyo Institute of Technology, Japan
(2) NECTEC, Thailand

One of the difficulties of Thai language modeling is the process of text corpus preparation. Because there is no explicit word boundary marker in written Thai text, word segmentation must be performed prior to training a language model. This paper presents two approaches to language model construction for Thai LVCSR based on pseudo-morpheme merging. The first approach merges pseudo-morphemes using forward and reverse bi-grams. The second approach utilizes the C4.5 decision tree to merge pseudo-morphemes based on multiple features. The performance of ASR systems with language models built using these methods are better than systems which use only pseudo-morpheme or lexicon-based word segmentation. These approaches produce results comparable to that obtained by the system utilizing manual segmentation.

Full Paper

Bibliographic reference.  Jongtaveesataporn, Markpong / Thienlikit, Issara / Wutiwiwatchai, Chai / Furui, Sadaoki (2007): "Towards better language modeling for Thai LVCSR", In INTERSPEECH-2007, 1553-1556.