5th European Conference on Speech Communication and Technology

Rhodes, Greece
September 22-25, 1997

Adaptive Topic - Dependent Language Modelling Using Word - Based Varigrams

Sven C. Martin, Jrg Liermann, Hermann Ney

Lehrstuhl fr Informatik VI, RWTH Aachen, University of Technology, Aachen, Germany

This paper presents two extensions of the standard interpolated word trigram and cache model, namely the extension of the trigram model by useful word m-grams with m > 3 resulting into a varigram model, and the addition of topic-specific trigram models. We give the criteria for selecting useful m-grams and for partitioning the training corpus into topic-specific subcorpora. We apply both extensions, separately and in combination, to corpora of 4 and 39 million words taken from the Wall Street Journal Corpus and show that high reductions in perplexity of up to 19 % on the largest corpus are achieved. We also performed some recognition experiments.

