15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Euronews: A Multilingual Benchmark for ASR and LID

Roberto Gretter

FBK, Italy

In this paper we present the first recognition experiments on a multilingual speech corpus, designed for Automatic Speech Recognition (ASR) and Language IDentification (LID) purposes. Data come from the portal Euronews and were acquired both from the Web and from TV. The corpus includes data in 10 languages (Arabic, English, French, German, Italian, Polish, Portuguese, Russian, Spanish and Turkish). For each language, the corpus is composed of about 100 hours of speech for training (60 for Polish) and about 4 hours, manually transcribed, for testing. Training data include the audio, some reference text, the ASR output and their alignment. 10 baselines were prepared — one for each language — using only the training data, and performance are evaluated on a subset of the test data. Also a LID system was implemented, capable to recognize words belonging to different languages in a continuous stream. Part of the corpus is freely available, for research purposes only, within the multilingual ASR benchmark for IWSLT 2014.

Full Paper

Bibliographic reference.  Gretter, Roberto (2014): "Euronews: a multilingual benchmark for ASR and LID", In INTERSPEECH-2014, 1603-1607.