ITRW on
Adaptation Methods for Speech Recognition

August 29-30, 2001
Sophia Antipolis, France

Lexicon Adaptation for Broadcast News Transcription

Nicola Bertoldi and Marcello Federico

ITC-irst - Centro per la Ricerca Scientifica e Tecnologica, Povo, Trento, Italy

This paper presents a technique for dynamically extending the language model lexicon of an Italian broadcast news transcription system. New words are selected dayby- day, from contemporary news available on the Internet, according to a strategy that tries to minimize the out-of-vocabulary rate of the language model. Phonetic transcriptions of new words are generated automatically with an in-house developed software tool. Experiments, performed with the ITC-irst 62K-word baseline system, show that using approximate phonetic transcriptions for less frequent words does not impact on recognition performance. Lexicon extension up to 122K words were evaluated on 19 news programs, spanning over one month, for a total of 6 hours of speech. The best lexicon extension strategy permitted to reduce the out-ofvocabulary rate by 61.8%, from 1.57% to 0.60%, and the word error rate by 2.16%, from 25.03% to 24.49%.

Full Paper

Bibliographic reference.  Bertoldi, Nicola / Federico, Marcello (2001): "Lexicon adaptation for broadcast news transcription", In Adaptation-2001, 187-190.