A novel method is proposed to improve the duration prediction for HMM based speech synthesis. Based on the decision tree trained by the conventional HTS training method, the duration instances of every leaf node are further clustered into several classes by the K-means clustering method, and the mapping functions between the context features and class labels are trained by CRF. Instead of using the mean value of the Gaussian distribution of a leaf node in the decision tree as the predicted duration, the weighted summation of the multi-centroids from these several clustered classes is used to predict the phoneme duration. The weights are given by the output probability provided by CRF according to input context features and the prior probability from the clustering results. Compared with conventional HTS method, experiments show that the proposed method can significantly reduce RMSE in objective evaluations and achieves better preference scores in the subjective evaluations.
Bibliographic reference. Kang, Yongguo / Li, Jian / Deng, Yan / Wang, Miaomiao (2013): "Multi-centroidal duration generation algorithm for HMM-based TTS", In INTERSPEECH-2013, 1540-1543.