DNN-based Speech Synthesis for Small Data Sets Considering Bidirectional Speech-Text Conversion

Kentaro Sone, Toru Nakashika


In statistical parametric speech synthesis, approaches based on deep neural networks (DNNs) have improved qualities of the synthesized speech. General DNN-based approaches require a large amount of training data to synthesize natural speech. However, it is not practical to record speech for many hours from a single speaker. To address this problem, this paper presents a novel pre-training method of DNN-based speech synthesis systems for small data sets. In this method, a Gaussian-Categorical deep relational model (GCDRM), which represents a joint probability of two visible variables, is utilized to describe the joint distribution of acoustic features and linguistic features. During the maximum-likelihood-based training, the model attempts to obtain parameters of a deep architecture considering the bidirectional conversion between 1) generated acoustic features given linguistic features and 2) re-generated linguistic features given acoustic features generated from itself. Owing to considering whether the generated acoustic features are recognizable, our method can obtain reasonable parameters from small data sets. Experimental results show that pre-trained DNN-based systems using our proposed method outperformed randomly-initialized DNN-based systems. This method also outperformed DNN-based systems in a speaker-dependent speech recognition task.


 DOI: 10.21437/Interspeech.2018-1460

Cite as: Sone, K., Nakashika, T. (2018) DNN-based Speech Synthesis for Small Data Sets Considering Bidirectional Speech-Text Conversion. Proc. Interspeech 2018, 2519-2523, DOI: 10.21437/Interspeech.2018-1460.


@inproceedings{Sone2018,
  author={Kentaro Sone and Toru Nakashika},
  title={DNN-based Speech Synthesis for Small Data Sets Considering Bidirectional Speech-Text Conversion},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2519--2523},
  doi={10.21437/Interspeech.2018-1460},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1460}
}