ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

E2E-S2S-VC: End-To-End Sequence-To-Sequence Voice Conversion

Takuma Okamoto, Tomoki Toda, Hisashi Kawai

This paper proposes end-to-end (E2E) non-autoregressive sequence-to-sequence (S2S) voice conversion (VC) models that extend two E2E text-to-speech models, VITS and JETS. In the proposed E2E-S2S-VC models, VITS-VC and JETS-VC, the input text sequences of VITS and JETS are replaced by the source speaker's acoustic feature sequences, and E2E models (including HiFi-GAN waveform synthesizers) are trained using monotonic alignment search (MAS) without external aligners. To successfully train MAS for VC, the proposed models use a reduction factor only for the encoder. The voice of a source speaker is converted directly to that of a target speaker using a single neural network in the proposed models in an S2S manner; the duration and prosody between the source and target speech can be directly converted. The results of experiments using 1,000 parallel utterances of Japanese male and female speakers demonstrate that the proposed JETS-VC outperformed cascade non-autoregressive S2S VC models.

doi: 10.21437/Interspeech.2023-2518

Cite as: Okamoto, T., Toda, T., Kawai, H. (2023) E2E-S2S-VC: End-To-End Sequence-To-Sequence Voice Conversion. Proc. INTERSPEECH 2023, 2043-2047, doi: 10.21437/Interspeech.2023-2518

  author={Takuma Okamoto and Tomoki Toda and Hisashi Kawai},
  title={{E2E-S2S-VC: End-To-End Sequence-To-Sequence Voice Conversion}},
  booktitle={Proc. INTERSPEECH 2023},