This paper describes the system developed by Biometric Vox for the Albayzin Speech-To-Text Challenge organized as part of the Iberspeech 2020 conference. The system uses speaker diarization to segment the audio into speaker-homogeneous segments and uses this information to compute speaker-dependent fMLLR transformed features. These speaker-adapted features are the input to a DNN acoustic model which is trained for the domain at hand using a semi-supervised self-training procedure. Finally, a RNN language model is used for lattice rescoring and producing the final transcription. Our system achieves 22% WER on the test portion of the RTVE2018 database and 30,26% on the 2020 evaluation set.
Cite as: Font, R., Grau, T. (2021) The Biometric Vox System for the Albayzin-RTVE 2020 Speech-to-Text Challenge. Proc. IberSPEECH 2021, 99-103, doi: 10.21437/IberSPEECH.2021-21
@inproceedings{font21_iberspeech, author={Roberto Font and Teresa Grau}, title={{The Biometric Vox System for the Albayzin-RTVE 2020 Speech-to-Text Challenge}}, year=2021, booktitle={Proc. IberSPEECH 2021}, pages={99--103}, doi={10.21437/IberSPEECH.2021-21} }