ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Bag-of-word normalized n-gram models

Abhinav Sethy, Bhuvana Ramabhadran

The Bag-Of-Word (BOW) model uses a fixed length vector of word counts to represent text. Although the model disregards word sequence information, it has been shown to be successful in capturing long range word-word correlations and topic information. In contrast, n-gram models have been shown to be an effective way to capture short term dependencies by modeling text as a Markovian sequence. In this paper, we propose a probabilistic framework to combine BOW models with n-gram models. In the proposed framework, we normalize the n-gram model to build a model for word sequences given the corresponding bag-of-words representation. By combining the two models, the proposed approach allows us to capture the latent topic information as well as local Markovian dependencies in text. Using the proposed model, we were able to achieve a 10% reduction in perplexity and a 2% reduction in WER (relative) over a state-of-the-art baseline for transcribing broadcast news in English.

doi: 10.21437/Interspeech.2008-265

Cite as: Sethy, A., Ramabhadran, B. (2008) Bag-of-word normalized n-gram models. Proc. Interspeech 2008, 1594-1597, doi: 10.21437/Interspeech.2008-265

  author={Abhinav Sethy and Bhuvana Ramabhadran},
  title={{Bag-of-word normalized n-gram models}},
  booktitle={Proc. Interspeech 2008},