End-to-end speech translation (ST) translates source language speech directly into target language without an intermediate automatic speech recognition (ASR) output, as in a cascading approach. End-to-end ST has the advantage of avoiding error propagation from the intermediate ASR results, but its performance still lags behind the cascading approach. A recent effort to increase performance is multi-task learning using an auxiliary task of ASR. However, previous multi-task learning for end-to-end ST using cross entropy (CE) loss in ASR-task targets one-hot references and does not consider ASR confusion. In this study, we propose a novel end-to-end ST training method using ASR loss against ASR posterior distributions given by a pre-trained model, which we call ASR posterior-based loss. The proposed method is expected to consider possible ASR confusion due to competing hypotheses with similar pronunciations. The proposed method demonstrated better BLEU results in our Fisher Spanish-to-English translation experiments than the baseline with standard CE loss with label smoothing.
Cite as: Ko, Y., Sudoh, K., Sakti, S., Nakamura, S. (2021) ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation. Proc. Interspeech 2021, 2272-2276, doi: 10.21437/Interspeech.2021-1105
@inproceedings{ko21_interspeech, author={Yuka Ko and Katsuhito Sudoh and Sakriani Sakti and Satoshi Nakamura}, title={{ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={2272--2276}, doi={10.21437/Interspeech.2021-1105} }