Joint Training Framework for Text-to-Speech and Voice Conversion Using Multi-Source Tacotron and WaveNet

Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, Junichi Yamagishi


We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks. We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks. This model can accomplish these two different tasks respectively according to the type of input. An end-to-end speech synthesis task is conducted when the model is given text as the input while a sequence-to-sequence voice conversion task is conducted when it is given the speech of a source speaker as the input. Waveform signals are generated by using WaveNet, which is conditioned by using a predicted mel-spectrogram. We propose jointly training a shared model as a decoder for a target speaker that supports multiple sources. Listening experiments show that our proposed multi-source encoder-decoder model can efficiently achieve both the TTS and VC tasks.


 DOI: 10.21437/Interspeech.2019-1357

Cite as: Zhang, M., Wang, X., Fang, F., Li, H., Yamagishi, J. (2019) Joint Training Framework for Text-to-Speech and Voice Conversion Using Multi-Source Tacotron and WaveNet. Proc. Interspeech 2019, 1298-1302, DOI: 10.21437/Interspeech.2019-1357.


@inproceedings{Zhang2019,
  author={Mingyang Zhang and Xin Wang and Fuming Fang and Haizhou Li and Junichi Yamagishi},
  title={{Joint Training Framework for Text-to-Speech and Voice Conversion Using Multi-Source Tacotron and WaveNet}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1298--1302},
  doi={10.21437/Interspeech.2019-1357},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1357}
}