ISCA Archive IberSPEECH 2022
ISCA Archive IberSPEECH 2022

GTTS Systems for the Albayzin 2022 Speech and Text Alignment Challenge

Germán Bordel, Luis Javier Rodriguez-Fuentes, Mikel Peñagarikano, Amparo Varona

This paper describes the most relevant features of the alignment approach used by our research group (GTTS) for the Albayzin 2022 Text and Speech Alignment Challenge: Alignment of respoken subtitles (TaSAC-ST). It also presents and analyzes the results obtained by our primary and contrastive systems, focusing on the variability observed in the RTVE broadcasts used for this evaluation. The task is to provide some hypothesized start and end times for each subtitle to be aligned. To that end, our systems decode the audio at the phonetic level using acoustic models trained on external (non-RTVE) data, then align the recognized sequence of phones with the phonetic transcription of the corresponding text and transfer the timestamps of the recognized phones to the aligned text. The alignment error for each subtitle is computed as the sum of the absolute values of the start and end alignment errors (with regard to a manually supervised ground truth). The median of the alignment errors (MAE) for each broadcast is reported to compare system performance. Our primary system yielded MAEs between 0.20 and 0.36 seconds on the development set, and between 0.22 and 1.30 seconds on the test set, with average MAEs of 0.295 and 0.395, respectively.

doi: 10.21437/IberSPEECH.2022-58

Cite as: Bordel, G., Rodriguez-Fuentes, L.J., Peñagarikano, M., Varona, A. (2022) GTTS Systems for the Albayzin 2022 Speech and Text Alignment Challenge. Proc. IberSPEECH 2022, 285-289, doi: 10.21437/IberSPEECH.2022-58

  author={Germán Bordel and Luis Javier Rodriguez-Fuentes and Mikel Peñagarikano and Amparo Varona},
  title={{GTTS Systems for the Albayzin 2022 Speech and Text Alignment Challenge}},
  booktitle={Proc. IberSPEECH 2022},