Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

Linguistic Features for Whole Sentence Maximum Entropy Language Models

Xiaojin Zhu, Stanley F. Chen, Ronald Rosenfeld

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Whole sentence maximum entropy models directly model the probability of a sentence using features arbitrary computable properties of the sentence. We investigate whether linguistic features that capture the underlying linguistic structure of a sentence can improve modeling. We use a shallow parser to parse sentences into linguistic constituents in two corpora; one is the original training corpus, and the other is an artificial corpus generated from an initial trigram model. We define three sets of candidate linguistic features based on these constituents, and compute the prevalence of each feature in the two data sets. We select features with significantly different frequencies. These correspond to phenomena poorly modeled by traditional trigrams, and reveal interesting linguistic deficiencies of the initial model. We found 6798 linguistic features in the Switchboard domain and achieved small improvements in perplexity and speech recognition accuracy with these features.

Full Paper (PDF)   Gnu-Zipped Postscript

Bibliographic reference.  Zhu, Xiaojin / Chen, Stanley F. / Rosenfeld, Ronald (1999): "Linguistic features for whole sentence maximum entropy language models", In EUROSPEECH'99, 1807-1810.