Accurate modeling and prediction of speech-sound durations are important in generating natural synthetic speech. This paper focuses on both feature and training objective aspects to improve the performance of the phone duration model for speech synthesis system. In feature aspect, we combine the feature representation from gradient boosting decision tree (GBDT) and phoneme identity embedding model (which is realized by the jointly training of phoneme embedded vector (PEV) and word embedded vector (WEV)) for BLSTM to predict the phone duration. The PEV is used to replace the one-hot phoneme identity, and GBDT is utilized to transform the traditional contextual features. In the training objective aspect, a new training objective function which taking into account of the correlation and consistency between the predicted utterance and the natural utterance is proposed. Perceptual tests indicate the proposed methods could improve the naturalness of the synthetic speech, which benefits from the proposed feature representation methods could capture more precise contextual features, and the proposed training objective function could tackle the over-averaged problem for the generated phone durations.
Cite as: Zheng, Y., Tao, J., Wen, Z., Li, Y., Liu, B. (2017) Investigating Efficient Feature Representation Methods and Training Objective for BLSTM-Based Phone Duration Prediction. Proc. Interspeech 2017, 784-788, doi: 10.21437/Interspeech.2017-1086
@inproceedings{zheng17_interspeech, author={Yibin Zheng and Jianhua Tao and Zhengqi Wen and Ya Li and Bin Liu}, title={{Investigating Efficient Feature Representation Methods and Training Objective for BLSTM-Based Phone Duration Prediction}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={784--788}, doi={10.21437/Interspeech.2017-1086} }