End-to-End Speech Translation with the Transformer

Laura Cross Vila, Carlos Escolano, José A. R. Fonollosa, Marta R. Costa-Jussà


Speech Translation has been traditionally addressed with the concatenation of two tasks: Speech Recognition and Machine Translation. This approach has the main drawback that errors are concatenated. Recently, neural approaches to Speech Recognition and Machine Translation have made possible facing the task by means of an End-to-End Speech Translation architecture. In this paper, we propose to use the architecture of the Transformer which is based solely on attention-based mechanisms to address the End-to-End Speech Translation system. As a contrastive architecture, we use the same Transformer to built the Speech Recognition and Machine Translation systems to perform Speech Translation through concatenation of systems. Results on the Spanish-to-English IWSLT benchmark task show that the end-to-end architecture is able to outperform the concatenated systems by half point BLEU.


 DOI: 10.21437/IberSPEECH.2018-13

Cite as: Cross Vila, L., Escolano, C., Fonollosa, J.A.R., R. Costa-Jussà, M. (2018) End-to-End Speech Translation with the Transformer. Proc. IberSPEECH 2018, 60-63, DOI: 10.21437/IberSPEECH.2018-13.


@inproceedings{Cross Vila2018,
  author={Laura {Cross Vila} and Carlos Escolano and José A. R. Fonollosa and Marta {R. Costa-Jussà}},
  title={{End-to-End Speech Translation with the Transformer}},
  year=2018,
  booktitle={Proc. IberSPEECH 2018},
  pages={60--63},
  doi={10.21437/IberSPEECH.2018-13},
  url={http://dx.doi.org/10.21437/IberSPEECH.2018-13}
}