ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Thai named-entity recognition using class-based language modeling on multiple-sized subword units

Kwanchiva Saykhum, Vataya Boonpiam, Nattanun Thatphithakkul, Chai Wutiwiwatchai, Cholwich Natthee

This article investigates as an early work on speech recognition of Thai named-entities, which is a crucial out-of-vocabulary word problem in broadcast news transcription. Motivated by an analysis on Thai-name structure, a statistical class-based language model is applied on multiple-sized subword units with a constraint on subword positions. Subwords can be defined automatically by their statistics. The proposed model is evaluated on Thai person name recognition in broadcast news data. Based on the subword inventory built from a very large training set of Thai names, only 0.7% out-of-vocabulary subwords are found in the test set. The best configured system incorporating both syllable merging and subword clustering algorithms achieves an approximately 40% syllable accuracy with 25% of names fully discovered.


doi: 10.21437/Interspeech.2008-263

Cite as: Saykhum, K., Boonpiam, V., Thatphithakkul, N., Wutiwiwatchai, C., Natthee, C. (2008) Thai named-entity recognition using class-based language modeling on multiple-sized subword units. Proc. Interspeech 2008, 1586-1589, doi: 10.21437/Interspeech.2008-263

@inproceedings{saykhum08_interspeech,
  author={Kwanchiva Saykhum and Vataya Boonpiam and Nattanun Thatphithakkul and Chai Wutiwiwatchai and Cholwich Natthee},
  title={{Thai named-entity recognition using class-based language modeling on multiple-sized subword units}},
  year=2008,
  booktitle={Proc. Interspeech 2008},
  pages={1586--1589},
  doi={10.21437/Interspeech.2008-263}
}