This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politecnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge.
The primary system (p-streaming_1500ms_nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM) and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds.
Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2-streaming_600ms_t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81+-0.09 seconds (mean+-stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative.
Cite as: Jorge, J., Giménez, A., Baquero-Arnal, P., Iranzo-Sánchez, J., Pérez, A., Garcés Díaz-Munío, G.V., Silvestre-Cerdà, J.A., Civera, J., Sanchis, A., Juan, A. (2021) MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge. Proc. IberSPEECH 2021, 118-122, doi: 10.21437/IberSPEECH.2021-25
@inproceedings{jorge21_iberspeech, author={Javier Jorge and Adrià Giménez and Pau Baquero-Arnal and Javier Iranzo-Sánchez and Alejandro Pérez and Gonçal V. {Garcés Díaz-Munío} and Joan Albert Silvestre-Cerdà and Jorge Civera and Albert Sanchis and Alfons Juan}, title={{MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge}}, year=2021, booktitle={Proc. IberSPEECH 2021}, pages={118--122}, doi={10.21437/IberSPEECH.2021-25} }