12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Unary Data Structures for Language Models

Jeffrey Sorensen (1), Cyril Allauzen (2)

(1) Google Inc., USA
(2) Google Research, USA

Language models are important components of speech recognition and machine translation systems. Trained on billions of words, and consisting of billions of parameters, language models often are the single largest components of these systems. There have been many proposed techniques to reduce the storage requirements for language models. A technique based upon pointer-free compact storage of ordinal trees shows compression competitive with the best proposed systems, while retaining the full finite state structure, and without using computationally expensive block compression schemes or lossy quantization techniques.

Full Paper

Bibliographic reference.  Sorensen, Jeffrey / Allauzen, Cyril (2011): "Unary data structures for language models", In INTERSPEECH-2011, 1425-1428.