Topic model is an important approach for language model (LM) adaptation and has attracted research interest for a long time. Latent Dirichlet Allocation (LDA), which assumes generative Dirichlet distribution with bag-of-word features for hidden topics, has been widely used as the state-of-the-art topic model. Inspired by recent development of a new paradigm of distributed paragraph representation called paragraph vector, a new topic model based on paragraph vector is proposed in this work. During training, each paragraph is mapped to a unique vector in continuous space. Then unsupervised clustering is performed to construct topic clusters. Topic-specific LM is then built based on clustering results. During adaptation, topic posterior is first estimated using the paragraph vector based topic model and new adapted LMs are constructed by interpolating the existing topic-specific models using topic posteriors. The proposed topic model is applied for N-gram LM adaptation and evaluated on Amazon Product Review Corpus for perplexity and a Chinese LVCSR task for CER evaluation. Results show that the proposed approach yields 11.1% relative perplexity reduction and 1.4% relative CER reduction over N-gram baseline, outperforming LDA based method proposed by previous work.
Bibliographic reference. Jin, Wengong / He, Tianxing / Qian, Yanmin / Yu, Kai (2015): "Paragraph vector based topic model for language model adaptation", In INTERSPEECH-2015, 3516-3520.