8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Morfessor and VariKN Machine Learning Tools for Speech and Language Technology

Vesa Siivola, Mathias Creutz, Mikko Kurimo

Helsinki University of Technology, Finland

This paper introduces two recent open source software packages developed for unsupervised natural language modeling. The Morfessor program segments words automatically into morpheme-like units without any rule-based morphological analyzers. The VariKN toolkit trains language models producing a compact set of high-order n-grams utilizing state-of-art Kneser-Ney smoothing. As an example, this paper shows how to construct a language model for speech recognition in multiple languages utilizing only a minimal amount of linguistic resources. Morfessor and VariKN also have other applications in text understanding, information retrieval and machine translation. Unsupervised machine learning techniques are particularly well suited for the development of systems for less-resourced languages, because they do not depend on manually designed morphological or syntactical analyzers or annotated data.

Full Paper

Bibliographic reference.  Siivola, Vesa / Creutz, Mathias / Kurimo, Mikko (2007): "Morfessor and variKN machine learning tools for speech and language technology", In INTERSPEECH-2007, 1549-1552.