We present a multispeaker, multilingual text-to-speech (TTS) synthesis
model based on Tacotron that is able to produce high quality speech
in multiple languages. Moreover, the model is able to transfer voices
across languages, e.g. synthesize fluent Spanish speech using an English
speaker’s voice, without training on any bilingual or parallel
examples. Such transfer works across distantly related languages, e.g.
English and Mandarin.
Critical to achieving this
result are: 1. using a phonemic input representation to encourage sharing
of model capacity across languages, and 2. incorporating an adversarial
loss term to encourage the model to disentangle its representation
of speaker identity (which is perfectly correlated with language in
the training data) from the speech content. Further scaling up the
model by training on multiple speakers of each language, and incorporating
an autoencoding input to help stabilize attention during training,
results in a model which can be used to consistently synthesize intelligible
speech for training speakers in all languages seen during training,
and in native or foreign accents.
DOI: 10.21437/Interspeech.2019-2668
Cite as: Zhang, Y., Weiss, R.J., Zen, H., Wu, Y., Chen, Z., Skerry-Ryan, R., Jia, Y., Rosenberg, A., Ramabhadran, B. (2019) Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. Proc. Interspeech 2019, 2080-2084, DOI: 10.21437/Interspeech.2019-2668.
@inproceedings{Zhang2019, author={Yu Zhang and Ron J. Weiss and Heiga Zen and Yonghui Wu and Zhifeng Chen and R.J. Skerry-Ryan and Ye Jia and Andrew Rosenberg and Bhuvana Ramabhadran}, title={{Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={2080--2084}, doi={10.21437/Interspeech.2019-2668}, url={http://dx.doi.org/10.21437/Interspeech.2019-2668} }