The state-of-the-art DNN speech synthesis system directly maps linguistic input to acoustic output and voice quality improvement over the conventional MSD-GMM-HMM synthesis system has been reported. DNN-based speech synthesis system does not require context clustering as in GMM-HMM systems and this was believed to be the main advantage and contributor to performance improvement. Our previous work has demonstrated that F0 interpolation, rather than context clustering, is the actual contributor for performance improvement. However, it remains unknown whether the use of unclustered context is a beneficial characteristic of DNN-based synthesis or not. In this paper, this issue is investigated in detail. Decision tree clustered contexts are used as linguistic input for DNN and compared to unclustered context input. A novel approach for inputting context clusters is proposed. Here, the decision tree question indicators are used as input instead of the clustered contexts. Experiments showed that DNN with clustered contexts significantly outperformed DNN with unclustered contexts and the proposed question indicator input approach obtained the best performance. The investigation of this paper reveals the limitation of DNN-based speech synthesis and implies that context clustering is also an important issue for DNN-based speech synthesis with limited training data.
Bibliographic reference. Chen, Bo / Chen, Zhehuai / Xu, Jiachen / Yu, Kai (2015): "An investigation of context clustering for statistical speech synthesis with deep neural network", In INTERSPEECH-2015, 2212-2216.