This article investigates as an early work on speech recognition of Thai named-entities, which is a crucial out-of-vocabulary word problem in broadcast news transcription. Motivated by an analysis on Thai-name structure, a statistical class-based language model is applied on multiple-sized subword units with a constraint on subword positions. Subwords can be defined automatically by their statistics. The proposed model is evaluated on Thai person name recognition in broadcast news data. Based on the subword inventory built from a very large training set of Thai names, only 0.7% out-of-vocabulary subwords are found in the test set. The best configured system incorporating both syllable merging and subword clustering algorithms achieves an approximately 40% syllable accuracy with 25% of names fully discovered.
Bibliographic reference. Saykhum, Kwanchiva / Boonpiam, Vataya / Thatphithakkul, Nattanun / Wutiwiwatchai, Chai / Natthee, Cholwich (2008): "Thai named-entity recognition using class-based language modeling on multiple-sized subword units", In INTERSPEECH-2008, 1586-1589.