11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Integration of Cache-Based Model and Topic Dependent Class Model with Soft Clustering and Soft Voting

Welly Naptali, Masatoshi Tsuchiya, Seiichi Nakagawa

Toyohashi University of Technology, Japan

A topic dependent class (TDC) language model (LM) is a topic-based LM that uses a semantic extraction method to reveal latent topic information from the relation of nouns. Previously, we have shown that TDC models outperform several state-of-the-art baseline models. There are two separate points that we would like to introduce in this paper. First, we improve the TDC model further by incorporating a cache-based LM through unigram scaling. Experiments on the Wall Street Journal (WSJ) and Japanese newspaper (Mainichi Shimbun) corpora show that this combination improves the model significantly in terms of perplexity. Second, a TDC stand-alone model suffers from a shrinking training corpus as the number of topics increases. We solve this problem by performing soft-clustering and soft-voting in the training and test phases. Experimental results using the WSJ corpus show that the TDC model outperforms the baseline without interpolation with a word-based n-gram.

Full Paper

Bibliographic reference.  Naptali, Welly / Tsuchiya, Masatoshi / Nakagawa, Seiichi (2010): "Integration of cache-based model and topic dependent class model with soft clustering and soft voting", In INTERSPEECH-2010, 2430-2433.