ISCA Archive Eurospeech 1999
ISCA Archive Eurospeech 1999

Variable-length sequence language model for large vocabulary continuous dictation machine

I. Zitouni, J. F. Mari, K. Smadli, Jean-Paul Haton

In natural language, some sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modeling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present two methods for automatically determining frequent phrases in unlabeled corpora of written sentences. These methods are based on information theoretic criteria which insure a high statistical consistency. Our models reach their local optimum since they minimize the perplexity. One procedure is based only on the n-gram language model to extract word sequences. The second one is based on a class n-gram model trained on 233 classes extracted from the eight grammatical classes of French. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words extracted from the “Le Monde" newspaper. Our models reduce perplexity by more than 20% compared with n-gram (nR3) and multigram models. In terms of recognition rate, our models outperform n-gram and multigram models.


doi: 10.21437/Eurospeech.1999-364

Cite as: Zitouni, I., Mari, J.F., Smadli, K., Haton, J.-P. (1999) Variable-length sequence language model for large vocabulary continuous dictation machine. Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999), 1811-1814, doi: 10.21437/Eurospeech.1999-364

@inproceedings{zitouni99_eurospeech,
  author={I. Zitouni and J. F. Mari and K. Smadli and Jean-Paul Haton},
  title={{Variable-length sequence language model for large vocabulary continuous dictation machine}},
  year=1999,
  booktitle={Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999)},
  pages={1811--1814},
  doi={10.21437/Eurospeech.1999-364}
}