ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Deliberation-Based Multi-Pass Speech Synthesis

Qingyun Dou, Xixin Wu, Moquan Wan, Yiting Lu, Mark J.F. Gales

Sequence-to-sequence (seq2seq) models have achieved state-of-the-art performance in a wide range of tasks including Neural Machine Translation (NMT) and Text-To-Speech (TTS). These models are usually trained with teacher forcing, where the reference back-history is used to predict the next token. This makes training efficient, but limits performance, because during inference the free-running back-history must be used. To address this problem, deliberation-based multi-pass seq2seq has been used in NMT. Here the output sequence is generated in multiple passes, each one conditioned on the initial input and the free-running output of the previous pass. This paper investigates, and compares, deliberation-based multi-pass seq2seq for TTS and NMT. For NMT the simplest form of multi-pass approaches, where the free-running first-pass output is combined with the initial input, improves performance. However, applying this scheme to TTS is challenging: the multi-pass model tends to converge to the standard single-pass model, ignoring the previous output. To tackle this issue, a guided attention loss is added, enabling the system to make more extensive use of the free-running output. Experimental results confirm the above analysis and demonstrate that the proposed TTS model outperforms a strong baseline.


doi: 10.21437/Interspeech.2021-1405

Cite as: Dou, Q., Wu, X., Wan, M., Lu, Y., Gales, M.J.F. (2021) Deliberation-Based Multi-Pass Speech Synthesis. Proc. Interspeech 2021, 136-140, doi: 10.21437/Interspeech.2021-1405

@inproceedings{dou21_interspeech,
  author={Qingyun Dou and Xixin Wu and Moquan Wan and Yiting Lu and Mark J.F. Gales},
  title={{Deliberation-Based Multi-Pass Speech Synthesis}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={136--140},
  doi={10.21437/Interspeech.2021-1405}
}