8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Vocabulary Selection for a Broadcast News Transcription System Using a Morpho-Syntactic Approach

Ciro Martins (1), António J. S. Teixeira (1), João Neto (2)

(1) Universidade de Aveiro, Portugal
(2) L2F INESC-ID/IST, Portugal

Although the vocabularies of ASR systems are designed to achieve high coverage for the expected domain, out-of-vocabulary (OOV) words cannot be avoided. Particularly, for daily and real-time transcription of Broadcast News (BN) data in highly inflected languages, the rapid vocabulary growth leads to high OOV word rates. To overcome this problem, we present a new morpho-syntactic approach to dynamically select the target vocabulary for this particular domain by trading off between the OOV word rate and vocabulary size.

We evaluate this approach against the common selection strategy based on word frequency. Experiments have been carried out for a European Portuguese BN transcription system. Results computed on seven news shows, yields a relative reduction of 37.8% in OOV word rate against the baseline system and 5.5% when compared with the word frequency common approach.

Full Paper

Bibliographic reference.  Martins, Ciro / Teixeira, António J. S. / Neto, João (2007): "Vocabulary selection for a broadcast news transcription system using a morpho-syntactic approach", In INTERSPEECH-2007, 2369-2372.