ISCA Archive SSW 2019
ISCA Archive SSW 2019

Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device

Riku Arakawa, Shinnosuke Takamichi, Hiroshi Saruwatari

Voice conversion (VC) enables us to change speech while preserving the linguistic information and is expected to play a significant role in augmented human communication. Recently, deep neural network (DNN)-based VC has been attracting attention because it can synthesize high-quality speech. However, existing methods typically assume offline processes (i.e., analysis, conversion, and synthesis) and cannot be directly applied to real-time VC. Therefore, we propose an implementation method of DNN-based VC that works online with low latency. We also propose audio data augmentation to improve the speech quality of real-time VC. Finally, we develop a maskbased real-time VC device to improve robustness against background noise. Experimental results demonstrate that 1) the proposed real-time VC works with 0.50 of the real-time factor, 2) the proposed data augmentation improves speech quality, and 3) the proposed mask-based VC device is more robust to noise than a standard microphone-based VC device.


doi: 10.21437/SSW.2019-17

Cite as: Arakawa, R., Takamichi, S., Saruwatari, H. (2019) Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), 93-98, doi: 10.21437/SSW.2019-17

@inproceedings{arakawa19_ssw,
  author={Riku Arakawa and Shinnosuke Takamichi and Hiroshi Saruwatari},
  title={{Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device}},
  year=2019,
  booktitle={Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10)},
  pages={93--98},
  doi={10.21437/SSW.2019-17}
}