Latent linguistic embedding for cross-lingual text-to-speech and voice conversion

Hieu-Thi Luong, Junichi Yamagishi


As the recently proposed voice cloning system, NAUTILUS, is apable of cloning unseen voices using untranscribed speech, we investigate the feasibility of using it to develop a unified cross-lingual TTS/VC system. Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally. This type of system is not simply cloning the voice of the target speaker, but essentially creating a new voice that can be considered better than the original under a specific framing. By using a well-trained English latent linguistic embedding to create a cross-lingual TTS and VC system for several German, Finnish, and Mandarin speakers included in the Voice Conversion Challenge 2020, we show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps. However, the subjective evaluations of perceived naturalness seemed to vary between target speakers, which is one aspect for future improvement.


 DOI: 10.21437/VCC_BC.2020-22

Cite as: Luong, H., Yamagishi, J. (2020) Latent linguistic embedding for cross-lingual text-to-speech and voice conversion. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 150-154, DOI: 10.21437/VCC_BC.2020-22.


@inproceedings{Luong2020,
  author={Hieu-Thi Luong and Junichi Yamagishi},
  title={{Latent linguistic embedding for cross-lingual text-to-speech and voice conversion}},
  year=2020,
  booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020},
  pages={150--154},
  doi={10.21437/VCC_BC.2020-22},
  url={http://dx.doi.org/10.21437/VCC_BC.2020-22}
}