ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization

Gonçal V. Garcés Díaz-Munío, Joan-Albert Silvestre-Cerdà, Javier Jorge, Adrià Giménez Pastor, Javier Iranzo-Sánchez, Pau Baquero-Arnal, Nahuel Roselló, Alejandro Pérez-González-de-Martos, Jorge Civera, Albert Sanchis, Alfons Juan

We introduce Europarl-ASR, a large speech and text corpus of parliamentary debates including 1 300 hours of transcribed speeches and 70 million tokens of text in English extracted from European Parliament sessions. The training set is labelled with the Parliament’s non-fully-verbatim official transcripts, time-aligned. As verbatimness is critical for acoustic model training, we also provide automatically noise-filtered and automatically verbatimized transcripts of all speeches based on speech data filtering and verbatimization techniques. Additionally, 18 hours of transcribed speeches were manually verbatimized to build reliable speaker-dependent and speaker-independent development/test sets for streaming ASR benchmarking. The availability of manual non-verbatim and verbatim transcripts for dev/test speeches makes this corpus useful for the assessment of automatic filtering and verbatimization techniques. This paper describes the corpus and its creation, and provides off-line and streaming ASR baselines for both the speaker-dependent and speaker-independent tasks using the three training transcription sets. The corpus is publicly released under an open licence.


doi: 10.21437/Interspeech.2021-1905

Cite as: Díaz-Munío, G.V.G., Silvestre-Cerdà, J.-A., Jorge, J., Pastor, A.G., Iranzo-Sánchez, J., Baquero-Arnal, P., Roselló, N., Pérez-González-de-Martos, A., Civera, J., Sanchis, A., Juan, A. (2021) Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization. Proc. Interspeech 2021, 3695-3699, doi: 10.21437/Interspeech.2021-1905

@inproceedings{diazmunio21_interspeech,
  author={Gonçal V. Garcés Díaz-Munío and Joan-Albert Silvestre-Cerdà and Javier Jorge and Adrià Giménez Pastor and Javier Iranzo-Sánchez and Pau Baquero-Arnal and Nahuel Roselló and Alejandro Pérez-González-de-Martos and Jorge Civera and Albert Sanchis and Alfons Juan},
  title={{Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3695--3699},
  doi={10.21437/Interspeech.2021-1905}
}