Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

Language Modeling Based on Automatic Word Concatenations

Christel Beaujard, Michéle Jardino

LIMSI-CNRS Université Paris-Sud, Orsay, France

This paper describes an automatic process which build variable length compound words, without fixing their maximum length, according to their contexts observed in a training text. Four criteria have been studied : the bigram frequency, a normalized measured based on the mutual information and left and right conditional probabilities. This work has been performed with a database recorded at LIMSI and made of rail travel information requests. The corresponding language models have been evaluated in terms of perplexity and speech error recognition rates with the LIMSI speech recognizer, and compared with a baseline word bigram model. Best results are obtained when the model is built with words concatenated with the left conditional probability.

