The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

Wen-Chin Huang, Tomoki Hayashi, Shinji Watanabe, Tomoki Toda


This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach to voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model, followed by the use of transcriptions to generate the voice of the target with a text-to-speech (TTS) model. We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit, and the many well-configured pretrained models provided by the community. Official evaluation results show that our system comes out on top among the participating systems in terms of conversion similarity, demonstrating the promising ability of seq2seq models to convert speaker identity. The implementation is made open source at https://github.com/espnet/espnet/tree/master/egs/vcc20.


 DOI: 10.21437/VCC_BC.2020-24

Cite as: Huang, W., Hayashi, T., Watanabe, S., Toda, T. (2020) The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 160-164, DOI: 10.21437/VCC_BC.2020-24.


@inproceedings{Huang2020,
  author={Wen-Chin Huang and Tomoki Hayashi and Shinji Watanabe and Tomoki Toda},
  title={{The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS}},
  year=2020,
  booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020},
  pages={160--164},
  doi={10.21437/VCC_BC.2020-24},
  url={http://dx.doi.org/10.21437/VCC_BC.2020-24}
}