EUROSPEECH 2001 Scandinavia
A new framework is proposed to construct corpus-based topic-adapted language models for large vocabulary speech recognition of highly-inflected Slovenian language. The proposed techniques can be applied to other Slavic languages, where words are formed by many different inflectional affixatation. In this article an attempt to overcome two important difficulties of highly-inflected languages (high out-of-vocabulary rate and the problem of topic detection) is described. The first problem is solved by the decomposition of words into stems and endings, and topic detection is improved by a novel approach for feature extraction based on soft comparison of words. The results of experiments on the second largest Slovenian newspaper news corpus Vecer show the decrease in perplexity by 17% in average over a general word-based model.
Bibliographic reference. Maucec, Mirjam Sepesy / Kacic, Zdravko (2001): "Topic detection for language model adaptation of highly-inflected languages by using a fuzzy comparison function", In EUROSPEECH-2001, 243-246.