Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device

Riku Arakawa, Shinnosuke Takamichi, Hiroshi Saruwatari


Voice conversion (VC) enables us to change speech while preserving the linguistic information and is expected to play a significant role in augmented human communication. Recently, deep neural network (DNN)-based VC has been attracting attention because it can synthesize high-quality speech. However, existing methods typically assume offline processes (i.e., analysis, conversion, and synthesis) and cannot be directly applied to real-time VC. Therefore, we propose an implementation method of DNN-based VC that works online with low latency. We also propose audio data augmentation to improve the speech quality of real-time VC. Finally, we develop a maskbased real-time VC device to improve robustness against background noise. Experimental results demonstrate that 1) the proposed real-time VC works with 0.50 of the real-time factor, 2) the proposed data augmentation improves speech quality, and 3) the proposed mask-based VC device is more robust to noise than a standard microphone-based VC device.


 DOI: 10.21437/SSW.2019-17

Cite as: Arakawa, R., Takamichi, S., Saruwatari, H. (2019) Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device. Proc. 10th ISCA Speech Synthesis Workshop, 93-98, DOI: 10.21437/SSW.2019-17.


@inproceedings{Arakawa2019,
  author={Riku Arakawa and Shinnosuke Takamichi and Hiroshi Saruwatari},
  title={{Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={93--98},
  doi={10.21437/SSW.2019-17},
  url={http://dx.doi.org/10.21437/SSW.2019-17}
}