8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

LSA-Based Language Model Adaptation for Highly Inflected Languages

Tanel Alumäe, Toomas Kirt

Tallinn University of Technology, Estonia

This paper presents a language model topic adaptation framework for highly inflected languages. In such languages, sub-word units are used as basic units for language modeling. Since such units carry little semantic information, they are not very suitable for topic adaptation. We propose to lemmatize the corpus of training documents before constructing a latent topic model. To adapt language model, we use few lemmatized training sentences to find a set of documents that are semantically close to the current document. Fast marginal adaptation of sub-word trigram language model is used for adapting the background model. Experiments on a set of Estonian test texts show that the proposed approach gives a 19% decrease in language model perplexity. A statistically significant decrease in perplexity is observed already when using just two sentences for adaptation. We also show that the model employing lemmatization gives consistently better results than the unlemmatized model.

Full Paper

Bibliographic reference.  Alumäe, Tanel / Kirt, Toomas (2007): "LSA-based language model adaptation for highly inflected languages", In INTERSPEECH-2007, 2357-2360.