ISCA Archive ICSLP 2000
ISCA Archive ICSLP 2000

N-gram distribution based language model adaptation

Jianfeng Gao, Mingjing Li, Kai-Fu Lee

This paper presents two techniques for language model (LM) adaptation. The first aims to build a more general LM. We propose a distribution-based pruning of n-gram LMs, where we prune n-grams that are likely to be infrequent in a new document. Experimental results show that the distribution-based pruning method performed up to 9% (word perplexity reduction) better than conventional cutoff methods. Moreover, the pruning method results in a more general ngram backoff model, in spite of the domain, style, or temporal bias in the training data.

The second aims to build a more task-specific LM. We propose an n-gram distribution adaptation method for LM training. Given a large set of out-of-task training data, called training set, and a small set of task-specific training data, called seed set, we adapt the LM towards the task by adjusting the n-gram distribution in the training set to that in the seed set. Experimental results show non-trivial improvements over conventional methods.

doi: 10.21437/ICSLP.2000-123

Cite as: Gao, J., Li, M., Lee, K.-F. (2000) N-gram distribution based language model adaptation. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 1, 497-500, doi: 10.21437/ICSLP.2000-123

  author={Jianfeng Gao and Mingjing Li and Kai-Fu Lee},
  title={{N-gram distribution based language model adaptation}},
  booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)},
  pages={vol. 1, 497-500},