The amount of available Thai broadcast news transcribed text for training a language model is still very limited, comparing to other major languages. Since the construction of a broadcast news corpus is very costly and time-consuming, newspaper text is often used to increase the size of training text data. This paper proposes a language model topic and style adaptation approach for a Thai broadcast news ASR system, using broadcast news and newspaper text. A rule-based speaking style classification method based on the existence of some specific words is applied to classify training text. Various kinds of language models adapted to topics and styles are studied and shown to successfully reduce test set perplexity and recognition error rate. The results also show that written style text from newspaper can be employed to alleviate the sparseness of the broadcast news corpus while spoken style text from the broadcast news corpus is still essential for building a reliable language model.
Bibliographic reference. Jongtaveesataporn, Markpong / Furui, Sadaoki (2010): "Topic and style-adapted language modeling for Thai broadcast news ASR", In INTERSPEECH-2010, 1828-1831.