Toward Expressive Speech Translation: A Unified Sequence-to-Sequence LSTMs Approach for Translating Words and Emphasis

Quoc Truong Do, Sakriani Sakti, Satoshi Nakamura


Emphasis is an important piece of paralinguistic information that is used to express different intentions, attitudes, or convey emotion. Recent works have tried to translate emphasis by developing additional emphasis estimation and translation components apart from an existing speech-to-speech translation (S2ST) system. Although these approaches can preserve emphasis, they introduce more complexity to the translation pipeline. The emphasis translation component has to wait for the target language sentence and word alignments derived from a machine translation system, resulting in a significant translation delay. In this paper, we proposed an approach that jointly trains and predicts words and emphasis in a unified architecture based on sequence-to-sequence models. The proposed model not only speeds up the translation pipeline but also allows us to perform joint training. Our experiments on the emphasis and word translation tasks showed that we could achieve comparable performance for both tasks compared with previous approaches while eliminating complex dependencies.


 DOI: 10.21437/Interspeech.2017-896

Cite as: Do, Q.T., Sakti, S., Nakamura, S. (2017) Toward Expressive Speech Translation: A Unified Sequence-to-Sequence LSTMs Approach for Translating Words and Emphasis. Proc. Interspeech 2017, 2640-2644, DOI: 10.21437/Interspeech.2017-896.


@inproceedings{Do2017,
  author={Quoc Truong Do and Sakriani Sakti and Satoshi Nakamura},
  title={Toward Expressive Speech Translation: A Unified Sequence-to-Sequence LSTMs Approach for Translating Words and Emphasis},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2640--2644},
  doi={10.21437/Interspeech.2017-896},
  url={http://dx.doi.org/10.21437/Interspeech.2017-896}
}