Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

Topic-Based Language Models Using EM

Daniel Gildea, Thomas Hofmann

University of California, Berkeley, and International Computer Science Institute, Berkeley, CA, USA

In this paper, we propose a novel statistical language model to capture topic-related long-range dependencies. Topics are modeled in a latent variable framework in which we also derive an EM algorithm to perform a topic factor decomposition based on a segmented training corpus. The topic model is combined with a standard language model to be used for on-line word prediction. Perplexity results indicate an improvement over previously proposed topic models, which unfortunately has not translated into lower word error.

Full Paper (PDF)   Gnu-Zipped Postscript

Bibliographic reference.  Gildea, Daniel / Hofmann, Thomas (1999): "Topic-based language models using EM", In EUROSPEECH'99, 2167-2170.