9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Bag-of-Word Normalized N-Gram Models

Abhinav Sethy, Bhuvana Ramabhadran

IBM T.J. Watson Research Center, USA

The Bag-Of-Word (BOW) model uses a fixed length vector of word counts to represent text. Although the model disregards word sequence information, it has been shown to be successful in capturing long range word-word correlations and topic information. In contrast, n-gram models have been shown to be an effective way to capture short term dependencies by modeling text as a Markovian sequence. In this paper, we propose a probabilistic framework to combine BOW models with n-gram models. In the proposed framework, we normalize the n-gram model to build a model for word sequences given the corresponding bag-of-words representation. By combining the two models, the proposed approach allows us to capture the latent topic information as well as local Markovian dependencies in text. Using the proposed model, we were able to achieve a 10% reduction in perplexity and a 2% reduction in WER (relative) over a state-of-the-art baseline for transcribing broadcast news in English.

Full Paper

Bibliographic reference.  Sethy, Abhinav / Ramabhadran, Bhuvana (2008): "Bag-of-word normalized n-gram models", In INTERSPEECH-2008, 1594-1597.