16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Sequence Generation Error (SGE) Minimization Based Deep Neural Networks Training for Text-to-Speech Synthesis

Yuchen Fan, Yao Qian, Frank K. Soong, Lei He

Microsoft, China

Feed-forward deep neural networks (DNNs) based text-to-speech (TTS) synthesis, which employs a multi-layered structure to exploit the statistical correlations between rich contextual information and the corresponding acoustic features, has been shown to outperform a decision tree based, GMM-HMM counterpart. However, the DNN-based TTS training has not taken the whole sequence, i.e., sentence, into account in optimization, hence results in some intrinsic inconsistency between training and testing. In this paper we propose a “sequence generation error” (SGE) minimization criterion for DNN-based TTS training. By incorporating the whole sequence parameter generation directly into the training process, the mismatch between training and testing is eliminated and the original constraints between the static and dynamic features are naturally embedded in the optimization process. Experimental results performed on a speech database of 5 hours show that DNN-based TTS trained with this new SGE minimization criterion can further improve the DNN baseline performance, particularly, in subjective listening tests.

Full Paper

Bibliographic reference.  Fan, Yuchen / Qian, Yao / Soong, Frank K. / He, Lei (2015): "Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis", In INTERSPEECH-2015, 864-868.