ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Bayesian latent topic clustering model

Meng-Sung Wu, Jen-Tzung Chien

Document modeling is important for document retrieval and categorization. The probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) are popular paradigms of document models where word/document correlations are inferred by latent topics. In PLSA and LDA, the unseen words and documents are not explicitly represented at the same time. Model generalization is constrained. This paper presents the Bayesian latent topic clustering (BLTC) model for document representation. The posterior distributions combined by Dirichlet priors and multinomial distributions are not only calculated in document level but also in word level. The modeling of unseen words and documents is tackled. An efficient variational inference method based on Gibbs sampling is presented to calculate the posterior probability of complex variables. In the experiments on TREC and Reuters-21578, the proposed BLTC performs better than PLSA and LDA in model perplexity and classification accuracy.

doi: 10.21437/Interspeech.2008-566

Cite as: Wu, M.-S., Chien, J.-T. (2008) Bayesian latent topic clustering model. Proc. Interspeech 2008, 2162-2165, doi: 10.21437/Interspeech.2008-566

  author={Meng-Sung Wu and Jen-Tzung Chien},
  title={{Bayesian latent topic clustering model}},
  booktitle={Proc. Interspeech 2008},