An Investigation of DNN-Based Speech Synthesis Using Speaker Codes

Nobukatsu Hojo, Yusuke Ijima, Hideyuki Mizuno


Recent studies have shown that DNN-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, an open problem remains as to whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a simple method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. The proposed method trains connection weights of the whole DNN using a multi-speaker speech corpus. When synthesizing a speech parameter sequence, a target speaker is chosen from the corpus and the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker’s voice. We investigated the relationship between the prediction performance and architecture of the DNNs by changing the input hidden layer for speaker codes. Experimental results showed that the proposed model outperformed the conventional speaker-dependent DNN when the model architecture was set at optimal for the amount of training data of the selected target speaker.


DOI: 10.21437/Interspeech.2016-589

Cite as

Hojo, N., Ijima, Y., Mizuno, H. (2016) An Investigation of DNN-Based Speech Synthesis Using Speaker Codes. Proc. Interspeech 2016, 2278-2282.

Bibtex
@inproceedings{Hojo+2016,
author={Nobukatsu Hojo and Yusuke Ijima and Hideyuki Mizuno},
title={An Investigation of DNN-Based Speech Synthesis Using Speaker Codes},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-589},
url={http://dx.doi.org/10.21437/Interspeech.2016-589},
pages={2278--2282}
}