Robustness of Statistical Voice Conversion Based on Direct Waveform Modification Against Background Sounds

Yusuke Kurita, Kazuhiro Kobayashi, Kazuya Takeda, Tomoki Toda


This paper presents an investigation of the robustness of statistical voice conversion (VC) under noisy environments. To develop various VC applications, such as augmented vocal production and augmented speech production, it is necessary to handle noisy input speech because some background sounds, such as external noise and an accompanying sound, usually exist in a real environment. In this paper, we investigate an impact of the background sounds on the conversion performance in singing voice conversion focusing on two main VC frameworks, 1) vocoder-based VC and 2) vocoder-free VC based on direct waveform modification. We conduct a subjective evaluation on the converted singing voice quality under noisy conditions and reveal that the vocoder-free VC is more robust against background sounds compared with the vocoder-based VC. We also analyze the robustness of statistical VC and show that a kurtosis ratio of power spectral components before and after conversion is useful as an objective metric to evaluate it without using any target reference signals.


 DOI: 10.21437/Interspeech.2019-2206

Cite as: Kurita, Y., Kobayashi, K., Takeda, K., Toda, T. (2019) Robustness of Statistical Voice Conversion Based on Direct Waveform Modification Against Background Sounds. Proc. Interspeech 2019, 684-688, DOI: 10.21437/Interspeech.2019-2206.


@inproceedings{Kurita2019,
  author={Yusuke Kurita and Kazuhiro Kobayashi and Kazuya Takeda and Tomoki Toda},
  title={{Robustness of Statistical Voice Conversion Based on Direct Waveform Modification Against Background Sounds}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={684--688},
  doi={10.21437/Interspeech.2019-2206},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2206}
}