5th International Conference on Spoken Language Processing
Stochastic language models based on word n-grams require huge amount of training material especially for large vocabulary systems. Using n-grams based on classes much less training material is necessary and higher coverage can be achieved. Building classes on basis of linguistic characteristics (POS) has the advantage that new words can be assigned easily. Until now for POS-based language models class sets have usually been defined by linguistic experts. In this paper we present an approach where for a given number of classes a class set is generated automatically such that entropy of language model is minimized. We perform experiments on German medical reports of about 1.2 million words of text and 24000 words of vocabulary. Using our approach we generate an exemplary class set of 196 optimized POS-classes. Comparing the optimized POS-based language model to the language model based on 196 normally defined classes we get an improvement up to 10% in test set perplexity.
Bibliographic reference. Witschel, Petra (1998): "Optimized POS-based language models for large vocabulary speech recognition", In ICSLP-1998, paper 0471.