International Symposium on Chinese Spoken Language Processing (ISCSLP 2002)

Taipei, Taiwan
August 23-24, 2002

Improving Language Modeling by Combining Heteogeneous Corpora

Zheng-Yu Zhou (1), Jian-Feng Gao (2), Eric Chang (2)

(1) Fudan University, Shanghai, China
(2) Microsoft Research Asia, Beijing, China

In applying statistical language modeling, directly adding training data (e.g. from website) may not always improve the performance of language models because the data may not be suitable for the application or contain errors. This paper presents a method of combining multiple heterogeneous corpora to improve the resulting language models, called compressed context-dependent interpolation scheme. The basic idea behind our method is that we not only want to filter good data, but also want to balance it among all the training data in order to give greater emphasis to data that better matches real usage scenarios or better balances our overall training set. Improvement on the accuracy of phone-character conversion has been observed in our experiments.

Full Paper

Bibliographic reference.  Zhou, Zheng-Yu / Gao, Jian-Feng / Chang, Eric (2002): "Improving language modeling by combining heteogeneous corpora", In ISCSLP 2002, paper 77.