14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

A Hybrid Language Model for Open-Vocabulary Thai LVCSR

Kwanchiva Thangthai, Ananlada Chotimongkol, Chai Wutiwiwatchai

NECTEC, Thailand

This paper investigates the use of a hybrid language model for open-vocabulary Thai LVCSR. Thai text is written without word boundary markers and the definition of word unit is often ambiguous due to the presence of compound words. Hence, to build open-vocabulary LVCSR, a very large lexicon is required to also handle word unit ambiguity. Pseudo-morpheme (PM), a syllable- like sub-word unit specifically designed for Thai is considered to be a more well-defined unit. To overcome the problem of out-of-vocabulary words and to also reduce the size of the lexicon, a hybrid language model which combines word and sub-word units is proposed. Words and sub-words frequently found in several domains constitute open-vocabulary for general domain Thai LVCSR. To verify our scheme, we run recognition experiments on data from various tasks including broadcast news transcription, dictation and mobile speech-to-speech translation. Open-vocabulary Thai LVCSR using the hybrid language model obviously reduces the out-of-vocabulary problem. The proposed model having a much smaller lexicon size achieves a comparable recognition error rate to a baseline system using a full-word lexicon.

Full Paper

Bibliographic reference.  Thangthai, Kwanchiva / Chotimongkol, Ananlada / Wutiwiwatchai, Chai (2013): "A hybrid language model for open-vocabulary Thai LVCSR", In INTERSPEECH-2013, 2207-2211.