ISCA Archive IberSPEECH 2022
ISCA Archive IberSPEECH 2022

ViVoLAB System Description for the S2TC IberSPEECH-RTVE 2022 challenge

Antonio Miguel, Alfonso Ortega, Eduardo Lleida

In this paper we describe the ViVoLAB system for the IberSPEECH-RTVE 2022 Speech to Text Transcription Challenge. The system is a combination of several subsystems designed to perform a full subtitle edition process from the raw audio to the creation of aligned subtitle transcribed partitions. The subsystems include a phonetic recognizer, a phonetic subword recognizer, a speaker-aware subtitle partitioner, a sequence-tosequence translation model working with orthographic tokens to produce the desired transcription, and an optional diarization step with the previously estimated segments. Additionally, we use recurrent network based language models to improve results for steps that involve search algorithms like the subword decoder and the sequence-to-sequence model. The technologies involved include unsupervised models like Wavlm to deal with the raw waveform, convolutional, recurrent, and transformer layers. As a general design pattern, we allow all the systems to access previous outputs or inner information, but the choice of successful communication mechanisms has been a difficult process due to the size of the datasets and long training times. The best solution found will be described and evaluated for some reference tests of 2018 and 2020 IberSPEECH-RTVE S2TC evaluations.

Cite as: Miguel, A., Ortega, A., Lleida, E. (2022) ViVoLAB System Description for the S2TC IberSPEECH-RTVE 2022 challenge. Proc. IberSPEECH 2022, 284

  author={Antonio Miguel and Alfonso Ortega and Eduardo Lleida},
  title={{ViVoLAB System Description for the S2TC IberSPEECH-RTVE 2022 challenge}},
  booktitle={Proc. IberSPEECH 2022},