ITRW on
Adaptation Methods for Speech Recognition

August 29-30, 2001
Sophia Antipolis, France

A Framework for Language Model Adaptation for Highly-Inflected Slovenian Language

Mirjam Sepesy Maucec, Zdravko Kacic, Bogomir Horvat

Faculty of Electrical Engineering and Computer Science, University of Maribor, Slovenia

This paper describes a new framework to construct topicadapted language models for large vocabulary speech recognition of highly-inflected Slovenian language. Two important difficulties of high inflectionality in Slovenian language are discussed, out-of-vocabulary rate and feature extraction for topic detection. To use the most popular language models (N-grams) and the well-known classifiers (TFIDF, naive Bayes) effectively, we define different basic units at different stages of language model construction. Basic language models use smaller lexical units. Words are decomposed into stems and endings. In contrast, classifiers require larger units, having semantic information. Words with the same meaning, but different grammatical form, are mapped into a set of equivalence classes. The proposed techniques for basic units selection are language independent. They can be applied to other languages, where words are formed by many different inflectional affixatation. Experimental results of adaptation obtained on the corpus of documents of the second largest Slovenian newspaper Ve¡cer show the additional 5% improvement in perplexity over basic morphological models.

Full Paper

Bibliographic reference.  Sepesy Maucec, Mirjam / Kacic, Zdravko / Horvat, Bogomir (2001): "A framework for language model adaptation for highly-inflected Slovenian language", In Adaptation-2001, 211-214.