ISCA Archive ISCSLP 2008
ISCA Archive ISCSLP 2008

PLSA Based Topic Mixture Language Modeling Approach

Shuan-Hu Bai, Hai-Zhou Li

In this paper, we propose a method to extend the use of latent topics into higher order n-gram models. In training, the parameters of higher order n-gram models are estimated using discounted average counts derived from the application of probabilistic latent semantic analysis(PLSA) models on n-gram counts in training corpus. In decoding, a simple yet efficient topic prediction method is introduced to predict its topic given a new document. The proposed topic mixture language model (TMLM) displays two advantages over previous methods: 1) having the ability of building topic mixture n-gram LM (n>1) and, 2) without requiring a big general baseline LM. The experimental results show that TMLMs, even using smaller number of topics, outperform LMs implemented using both standard n-gram approach and unsupervised adaptation approaches in terms of perplexity reductions. Index Terms— language modeling, topic mixture language model, PLSA


Cite as: Bai, S.-H., Li, H.-Z. (2008) PLSA Based Topic Mixture Language Modeling Approach. Proc. International Symposium on Chinese Spoken Language Processing, 185-188

@inproceedings{bai08_iscslp,
  author={Shuan-Hu Bai and Hai-Zhou Li},
  title={{PLSA Based Topic Mixture Language Modeling Approach}},
  year=2008,
  booktitle={Proc. International Symposium on Chinese Spoken Language Processing},
  pages={185--188}
}