Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

Assessment of Smoothing Methods and Complex Stochastic Language Modeling

Sven Martin, Christoph Hamacher, Jorg Liermann, Frank Wessel, Hermann Ney

Lehrstuhl für Informatik VI, RWTH Aachen - University of Technology, Aachen, Germany

This paper studies the overall effect of language modeling on perplexity and word error rate, starting from a trigram model with a standard smoothing method up to complex state-of-the-art language models:
(1) We compare different smoothing methods, namely linear vs. absolute discounting, interpolation vs. backing-off, and back-off functions based on relative frequencies vs. singleton events.
(2) We show the effect of complex language model techniques by using distant-trigrams and automatically selected word classes and word phrases using a maximum likelihood criterion (i.e. minimum perplexity).
(3) We show the overall gain of the combined application of the above techniques, as opposed to their separate assessment in past publications.
(4) We give perplexity and word error rate results on the North American Business corpus (NAB) with a training text of about 240 million words and on the German Verbmobil corpus.

Full Paper (PDF)   Gnu-Zipped Postscript

Bibliographic reference.  Martin, Sven / Hamacher, Christoph / Liermann, Jorg / Wessel, Frank / Ney, Hermann (1999): "Assessment of smoothing methods and complex stochastic language modeling", In EUROSPEECH'99, 1939-1942.