9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Automatic Estimation of Language Model Parameters for Unseen Words Using Morpho-Syntactic Contextual Information

Ciro Martins (1), António Teixeira (1), João Neto (2)

(1) Universidade de Aveiro, Portugal; (2) INESC-ID/IST, Portugal

Various information sources naturally contains new words that appear in a daily basis and which are not present in the vocabulary of the speech recognition system but are important for applications such as closed-captioning or information dissemination. To be recognized, those words need to be included in the vocabulary and the language model (LM) parameters updated. In this context, we propose a new method that allows including new words in the vocabulary even if no well suited training data is available, as is the case of archived documents, and without the need of LM retraining. It uses morpho-syntatic information about an in-domain corpus and part-of-speech word classes to define a new LM unigram distribution associated to the updated vocabulary. Experiments were carried out for a European Portuguese Broadcast News transcription system. Results showed a relative reduction of 4% in word error rate, with 78% of the occurrences of those newly included words being correctly recognized.

Full Paper

Bibliographic reference.  Martins, Ciro / Teixeira, António / Neto, João (2008): "Automatic estimation of language model parameters for unseen words using morpho-syntactic contextual information", In INTERSPEECH-2008, 1602-1605.