Language models play a very important role in many natural language processing applications, in particular large vocabulary speech recognition and statistical machine translation. For a long time, back-off n-gram language models were considered to be the state-of-art when large amounts of training data are available. Recently, so called continuous space methods or neural network language models have shown to systematically outperform these models and they are getting increasingly popular. This article describes an open-source toolkit that implements these models in a very efficient way, including support for GPU cards. The modular architecture makes it very easy to work with different data formats and to support various alternative models. Using data selection, resampling techniques and a highly optimized code, training on more than five billions words takes less than 24 hours. The resulting models achieve reductions in the perplexity of almost 20%. This toolkit has been very successfully applied to various languages for large vocabulary speech recognition and statistical machine translation. By making available this toolkit we hope that many more researchers will be able to work on this very promising technique, and by these means, quickly advance the field.
Bibliographic reference. Schwenk, Holger (2013): "CSLM — a modular open-source continuous space language modeling toolkit", In INTERSPEECH-2013, 1198-1202.