Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

Augmenting Words with Linguistic Information for N-gram Language Models

Lucian Galescu (1), Eric K. Ringger (2)

(1) Department of Computer Science, University of Rochester, NY, USA
(2) NLP Group, Microsoft Research, Redmond, WA, USA

The main goal of the present work is to explore the use of rich lexical information in language modelling. We reformulated the task of a language model from predicting the next word given its history to predicting simultaneously both the word and a tag encoding various types of lexical information. Using part-of-speech tags and syntactic/semantic feature tags obtained with a set of NLP tools developed at Microsoft Research, we obtained a reduction in perplexity compared to the baseline phrase trigram model in a set of preliminary tests performed on part of the WSJ corpus.

Keywords: speech recognition, statistical language modelling, n-gram models, phrase models, augmented-word models, POS tags, semantic/syntactic tags, NLPWin, WSJ corpus.

Full Paper (PDF)   Gnu-Zipped Postscript

Bibliographic reference.  Galescu, Lucian / Ringger, Eric K. (1999): "Augmenting words with linguistic information for n-gram language models", In EUROSPEECH'99, 2171-2174.