9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

IRSTLM: An Open Source Toolkit for Handling Large Scale Language Models

Marcello Federico, Nicola Bertoldi, Mauro Cettolo

FBK-irst, Italy

Research in speech recognition and machine translation is boosting the use of large scale n-gram language models. We present an open source toolkit that permits to efficiently handle language models with billions of n-grams on conventional machines. The IRSTLM toolkit supports distribution of n-gram collection and smoothing over a computer cluster, language model compression through probability quantization, lazy-loading of huge language models from disk. IRSTLM has been so far successfully deployed with the Moses toolkit for statistical machine translation and with the FBK-irst speech recognition system. Efficiency of the tool is reported on a speech transcription task of Italian political speeches using a language model of 1.1 billion four-grams.

Full Paper

Bibliographic reference.  Federico, Marcello / Bertoldi, Nicola / Cettolo, Mauro (2008): "IRSTLM: an open source toolkit for handling large scale language models", In INTERSPEECH-2008, 1618-1621.