8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

Very Large Vocabulary speech Recognition System for Automatic Transcription of Czech Broadcast Programs

Jan Nouza, Dana Nejedlova, Jindrich Zdansky, Jan Kolorenc

Technical University of Liberec, Czech Republic

This paper describes the first speech recognition system capable of transcribing a wide range of spoken broadcast programs in Czech language with the OOV rate being below 3 per cent. To achieve that level we had to a) create an optimized 200k word vocabulary with multiple text and pronunciation forms, b) extract an appropriate language model from a 300M word text corpus and c) develop an own decoder specially designed for the lexicon of that size. The system was tested on various types of broadcast programs with the following results: the Czech part of the European COST278 database of TV news (71.5 % accuracy rate on complete news streams, 82.7 % on their clean parts), radio news (80.2 %), read commentaries (78.6 %), broadcast debates (74.3 %) and recordings of the speeches given by state presidents (85.8 %).

Full Paper

Bibliographic reference.  Nouza, Jan / Nejedlova, Dana / Zdansky, Jindrich / Kolorenc, Jan (2004): "Very large vocabulary speech recognition system for automatic transcription of czech broadcast programs", In INTERSPEECH-2004, 409-412.