Adversarially Trained End-to-End Korean Singing Voice Synthesis System

Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, Kyogu Lee


In this paper, we propose an end-to-end Korean singing voice synthesis system from lyrics and a symbolic melody using the following three novel approaches: 1) phonetic enhancement masking, 2) local conditioning of text and pitch to the super-resolution network, and 3) conditional adversarial training. The proposed system consists of two main modules; a mel-synthesis network that generates a mel-spectrogram from the given input information, and a super-resolution network that upsamples the generated mel-spectrogram into a linear-spectrogram. In the mel-synthesis network, phonetic enhancement masking is applied to generate implicit formant masks solely from the input text, which enables a more accurate phonetic control of singing voice. In addition, we show that two other proposed methods — local conditioning of text and pitch, and conditional adversarial training — are crucial for a realistic generation of the human singing voice in the super-resolution process. Finally, both quantitative and qualitative evaluations are conducted, confirming the validity of all proposed methods.


 DOI: 10.21437/Interspeech.2019-1722

Cite as: Lee, J., Choi, H., Jeon, C., Koo, J., Lee, K. (2019) Adversarially Trained End-to-End Korean Singing Voice Synthesis System. Proc. Interspeech 2019, 2588-2592, DOI: 10.21437/Interspeech.2019-1722.


@inproceedings{Lee2019,
  author={Juheon Lee and Hyeong-Seok Choi and Chang-Bin Jeon and Junghyun Koo and Kyogu Lee},
  title={{Adversarially Trained End-to-End Korean Singing Voice Synthesis System}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2588--2592},
  doi={10.21437/Interspeech.2019-1722},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1722}
}