This paper presents our latest efforts toward large vocabulary speech recognition systems for five Eastern European languages such as Russian, Bulgarian, Czech, Croatian and Polish using the Rapid Language Adaptation Toolkit (RLAT) . We investigated the possibility of crawling large quantities of text material from the Internet, which is very cheap but also requires text post-processing steps due to the varying text quality. The goal of this study is to determine the best strategy for language model optimization on the given domain in a short time period with minimal human effort. Our results show that we can build an initial ASR system for these five languages in only ten days using RLAT. On the multilingual GlobalPhone speech corpus  we achieved a Word Error Rate (WER) of 16.9% for Bulgarian, 23.5% for Czech, 20.4% for Polish, 32.8% for Croatian and 36.2% for Russian.
Bibliographic reference. Vu, Ngoc Thang / Schlippe, Tim / Kraus, Franziska / Schultz, Tanja (2010): "Rapid bootstrapping of five eastern european languages using the rapid language adaptation toolkit", In INTERSPEECH-2010, 865-868.