8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Language Model Adaptation Using Latent Dirichlet Allocation and an Efficient Topic Inference Algorithm

Aaron Heidel (1), Hung-an Chang (2), Lin-shan Lee (1)

(1) National Taiwan University, Taiwan
(2) MIT, USA

We present an effort to perform topic mixture-based language model adaptation using latent Dirichlet allocation (LDA). We use probabilistic latent semantic analysis (PLSA) to automatically cluster a heterogeneous training corpus, and train an LDA model using the resultant topic-document assignments. Using this LDA model, we then construct topic-specific corpora at the utterance level for interpolation with a background language model during language model adaptation. We also present a novel iterative algorithm for LDA topic inference. Very encouraging results were obtained in preliminary experiments with broadcast news in Mandarin Chinese.

Full Paper

Bibliographic reference.  Heidel, Aaron / Chang, Hung-an / Lee, Lin-shan (2007): "Language model adaptation using latent dirichlet allocation and an efficient topic inference algorithm", In INTERSPEECH-2007, 2361-2364.