EUROSPEECH 2001 Scandinavia
7th European Conference on Speech Communication and Technology

Aalborg, Denmark
September 3-7, 2001


Statistical Language Model Based On a Hierarchical Approach: MCnv

Imed Zitouni, Kamel Smaili, Jean-Paul Haton

LORIA, France

In this paper, we propose a new language model based on dependent word sequences organized in a multi-level hierarchy. We call this model MCnv, where n is the maximum number of words in a sequence and v is the maximum number of levels. The originality of this model is its capacity to take into account dependent variable-length sequences for very large vocabularies. In order to discover the variable-length sequences and to build the hierarchy, we use a set of 233 syntactic classes extracted from the 8 French elementary grammatical classes. The MCnv model learns hierarchical word patterns and uses them to reevaluate and filter the n-best utterance hypotheses outputted by our speech recognizer MAUD. The model has been trained on a corpus of 43 million words extracted from a French newspaper and uses a vocabulary of 20000 words. Tests have been conducted on 300 sentences. Results achieved 17% decrease in perplexity compared to an interpolated class trigram model. Rescoring the original n-best hypotheses resulted in an improvement of 5% in accuracy.

Full Paper

Bibliographic reference.  Zitouni, Imed / Smaili, Kamel / Haton, Jean-Paul (2001): "Statistical language model based on a hierarchical approach: MCnv", In EUROSPEECH-2001, 29-32.