Feed-forward deep neural networks (DNNs) based text-to-speech (TTS) synthesis, which employs a multi-layered structure to exploit the statistical correlations between rich contextual information and the corresponding acoustic features, has been shown to outperform a decision tree based, GMM-HMM counterpart. However, the DNN-based TTS training has not taken the whole sequence, i.e., sentence, into account in optimization, hence results in some intrinsic inconsistency between training and testing. In this paper we propose a “sequence generation error” (SGE) minimization criterion for DNN-based TTS training. By incorporating the whole sequence parameter generation directly into the training process, the mismatch between training and testing is eliminated and the original constraints between the static and dynamic features are naturally embedded in the optimization process. Experimental results performed on a speech database of 5 hours show that DNN-based TTS trained with this new SGE minimization criterion can further improve the DNN baseline performance, particularly, in subjective listening tests.
Bibliographic reference. Fan, Yuchen / Qian, Yao / Soong, Frank K. / He, Lei (2015): "Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis", In INTERSPEECH-2015, 864-868.