9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Thai Named-Entity Recognition Using Class-Based Language Modeling on Multiple-Sized Subword Units

Kwanchiva Saykhum (1), Vataya Boonpiam (1), Nattanun Thatphithakkul (1), Chai Wutiwiwatchai (1), Cholwich Natthee (2)

(1) NECTEC, Thailand; (2) Thammasat University, Thailand

This article investigates as an early work on speech recognition of Thai named-entities, which is a crucial out-of-vocabulary word problem in broadcast news transcription. Motivated by an analysis on Thai-name structure, a statistical class-based language model is applied on multiple-sized subword units with a constraint on subword positions. Subwords can be defined automatically by their statistics. The proposed model is evaluated on Thai person name recognition in broadcast news data. Based on the subword inventory built from a very large training set of Thai names, only 0.7% out-of-vocabulary subwords are found in the test set. The best configured system incorporating both syllable merging and subword clustering algorithms achieves an approximately 40% syllable accuracy with 25% of names fully discovered.

Full Paper

Bibliographic reference.  Saykhum, Kwanchiva / Boonpiam, Vataya / Thatphithakkul, Nattanun / Wutiwiwatchai, Chai / Natthee, Cholwich (2008): "Thai named-entity recognition using class-based language modeling on multiple-sized subword units", In INTERSPEECH-2008, 1586-1589.