Voice conversion (VC) enables us to change speech while preserving the linguistic information and is expected to play a significant role in augmented human communication. Recently, deep neural network (DNN)-based VC has been attracting attention because it can synthesize high-quality speech. However, existing methods typically assume offline processes (i.e., analysis, conversion, and synthesis) and cannot be directly applied to real-time VC. Therefore, we propose an implementation method of DNN-based VC that works online with low latency. We also propose audio data augmentation to improve the speech quality of real-time VC. Finally, we develop a maskbased real-time VC device to improve robustness against background noise. Experimental results demonstrate that 1) the proposed real-time VC works with 0.50 of the real-time factor, 2) the proposed data augmentation improves speech quality, and 3) the proposed mask-based VC device is more robust to noise than a standard microphone-based VC device.
Cite as: Arakawa, R., Takamichi, S., Saruwatari, H. (2019) Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), 93-98, doi: 10.21437/SSW.2019-17
@inproceedings{arakawa19_ssw, author={Riku Arakawa and Shinnosuke Takamichi and Hiroshi Saruwatari}, title={{Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device}}, year=2019, booktitle={Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10)}, pages={93--98}, doi={10.21437/SSW.2019-17} }