INTERSPEECH 2010

This paper describes an extension of the ngram language model: the similar ngram language model. The estimation of the probability P(s) of a string s by the classical model of order n is computed using statistics of occurrences of the last n words of the string in the corpus, whereas the proposed model further uses all the strings s' for which the Levenshtein distance to s is smaller than a given threshold. The similarity between s and each string s' is estimated using cooccurrence statistics. The new P(s) is approximated by smoothing all the similar ngram probabilities with a regression technique. A slight but statistically significant decrease in the word error rate is obtained on a stateoftheart automatic speech recognition system when the similar ngram language model is interpolated linearly with the ngram model.
Bibliographic reference. Gillot, Christian / Cerisara, Christophe / Langlois, David / Haton, JeanPaul (2010): "Similar ngram language model", In INTERSPEECH2010, 18241827.