Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

How to Choose Training Set for Language Modeling

Hong Zhang, Bo Xu, Taiyi Huang

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

This paper investigates the problem of choosing the training set for language modeling in large vocabulary continuous speech recognition system. From our investigation, we find that the language style is more important than the domain in language modeling. Keeping the similarity of language style, extending of domain is not harmful. On the contrast, under this condition, the expanding size of the training set will improve the quality of the language model. Diversity of language styles in the training set will result in the degradation of the language model. The analysis of the correlation between CER and evaluation measures of language model indicates that under condition of same domain, same language style and whole model without cutoff, the perplexity correlates with CER strongly. Otherwise this correlation will be weakened. Another evaluation measure in our investigation, the Ngram hitting rate performs similarly to that of perplexity. To the back-off trigram model, the bigram hitting rate correlates stronger to CER than the trigram-hitting rate, which is meaningful to the size reduction of language model.

Full Paper

Bibliographic reference.  Zhang, Hong / Xu, Bo / Huang, Taiyi (2000): "How to choose training set for language modeling", In ICSLP-2000, vol.2, 523-526.