11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Similar N-Gram Language Model

Christian Gillot, Christophe Cerisara, David Langlois, Jean-Paul Haton

LORIA, France

This paper describes an extension of the n-gram language model: the similar n-gram language model. The estimation of the probability P(s) of a string s by the classical model of order n is computed using statistics of occurrences of the last n words of the string in the corpus, whereas the proposed model further uses all the strings s' for which the Levenshtein distance to s is smaller than a given threshold. The similarity between s and each string s' is estimated using co-occurrence statistics. The new P(s) is approximated by smoothing all the similar n-gram probabilities with a regression technique. A slight but statistically significant decrease in the word error rate is obtained on a state-of-the-art automatic speech recognition system when the similar n-gram language model is interpolated linearly with the n-gram model.

Full Paper

Bibliographic reference.  Gillot, Christian / Cerisara, Christophe / Langlois, David / Haton, Jean-Paul (2010): "Similar n-gram language model", In INTERSPEECH-2010, 1824-1827.