Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription

Juan M. Perero-Codosero, Javier Antón-Martín, Daniel Tapias Merino, Eduardo López-Gonzalo, Luis A. Hernández-Gómez


Deep Neural Networks (DNN) are fundamental part of current ASR. State-of-the-art are “hybrid” models in which acoustic models (AM) are designed using neural networks. However, there is an increasing interest in developing end-to-end Deep Learning solutions where a neural network is trained to predict character/grapheme or sub-word sequences which can be converted directly to words. Though several promising results have been reported for end-to-end ASR systems, it is still not clear if they are capable to unseat hybrid systems. In this contribution, we evaluate open-source state-of-the-art hybrid and end-to-end Deep Learning ASR under the IberSpeech-RTVE Speech to Text Transcription Challenge. The hybrid ASR is based on Kaldi and Wav2Letter will be the end-to-end framework. Experiments were carried out using 6 hours of dev1 and dev2 partitions. The lowest WER on the reference TV show (LM-20171107) was 22.23% for the hybrid system (lowercase format without punctuation). Major limitation for Wav2Letter has been a high training computational demand (between 6 hours and 1 day/epoch, depending on the training set). This forced us to stop the training process to meet the Challenge deadline. But we believe that with more training time it will provide competitive results with the hybrid system.


 DOI: 10.21437/IberSPEECH.2018-55

Cite as: Perero-Codosero, J.M., Antón-Martín, J., Tapias Merino, D., López-Gonzalo, E., Hernández-Gómez, L.A. (2018) Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription. Proc. IberSPEECH 2018, 262-266, DOI: 10.21437/IberSPEECH.2018-55.


@inproceedings{Perero-Codosero2018,
  author={Juan M. Perero-Codosero and Javier Antón-Martín and Daniel {Tapias Merino} and Eduardo López-Gonzalo and Luis A. Hernández-Gómez},
  title={{Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription}},
  year=2018,
  booktitle={Proc. IberSPEECH 2018},
  pages={262--266},
  doi={10.21437/IberSPEECH.2018-55},
  url={http://dx.doi.org/10.21437/IberSPEECH.2018-55}
}