8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


Large Vocabulary ASR for Spontaneous Czech in the MALACH Project

Josef Psutka (1), Pavel Ircing (1), J.V. Psutka (1), Vlasta Radova (1), William J. Byrne (2), Jan Hajic (3), Jiri Mirovsky (3), Samuel Gustman (4)

(1) University of West Bohemia in Pilsen, Czech Republic
(2) Johns Hopkins University, USA
(3) Charles University, Czech Republic
(4) Survivors of the Shoah Visual History Foundation, USA

This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) ( by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate.

Full Paper

Bibliographic reference.  Psutka, Josef / Ircing, Pavel / Psutka, J.V. / Radova, Vlasta / Byrne, William J. / Hajic, Jan / Mirovsky, Jiri / Gustman, Samuel (2003): "Large vocabulary ASR for spontaneous czech in the MALACH project", In EUROSPEECH-2003, 1821-1824.