9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Bayesian Latent Topic Clustering Model

Meng-Sung Wu, Jen-Tzung Chien

National Cheng Kung University, Taiwan

Document modeling is important for document retrieval and categorization. The probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) are popular paradigms of document models where word/document correlations are inferred by latent topics. In PLSA and LDA, the unseen words and documents are not explicitly represented at the same time. Model generalization is constrained. This paper presents the Bayesian latent topic clustering (BLTC) model for document representation. The posterior distributions combined by Dirichlet priors and multinomial distributions are not only calculated in document level but also in word level. The modeling of unseen words and documents is tackled. An efficient variational inference method based on Gibbs sampling is presented to calculate the posterior probability of complex variables. In the experiments on TREC and Reuters-21578, the proposed BLTC performs better than PLSA and LDA in model perplexity and classification accuracy.

Full Paper

Bibliographic reference.  Wu, Meng-Sung / Chien, Jen-Tzung (2008): "Bayesian latent topic clustering model", In INTERSPEECH-2008, 2162-2165.