14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Feature-Rich Sub-Lexical Language Models Using a Maximum Entropy Approach for German LVCSR

M. Ali Basha Shaik, Amr El-Desoky Mousa, Ralf Schlüter, Hermann Ney

RWTH Aachen University, Germany

German is a morphologically rich language having a high degree of word inflections, derivations and compounding. This leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities in the large vocabulary continuous speech recognition (LVCSR) systems. One of the main challenges in the German LVCSR is the recognition of the OOV words. For this purpose, datadriven morphemes are used to provide higher lexical coverage. On the other hand, the probability estimates of a sub-lexical LM could be further improved using feature-rich LMs like maximum entropy (MaxEnt) and class-based LMs. In this work, for a sub-lexical level German LVCSR task, we investigate the use of the multiple morpheme level features as classes for building class-based LMs that are estimated using the state-of-the-art MaxEnt approach. Thus, the benefits of both the MaxEnt LMs and the traditional class-based LMs are effectively combined. Furthermore, we experiment the use of Maximum a-posteriori adaptation over the MaxEnt class-based LMs. We show consistent reductions in both the OOV recognition error rate and the word error rate (WER) on a German LVCSR task from the Quaero project, compared to the traditional class-based and the N-gram morpheme based LM.

Full Paper

Bibliographic reference.  Shaik, M. Ali Basha / Mousa, Amr El-Desoky / Schlüter, Ralf / Ney, Hermann (2013): "Feature-rich sub-lexical language models using a maximum entropy approach for German LVCSR", In INTERSPEECH-2013, 3404-3408.