Comparison of Modeling Target in LSTM-RNN Duration Model

Bo Chen, Jiahao Lai, Kai Yu


Speech duration is an important component in statistical parameter speech synthesis(SPSS). In LSTM-RNN based SPSS system, the speech duration affects the quality of synthesized speech in two aspects, the prosody of speech and the position features in acoustic model. This paper investigated the effects of duration in LSTM-RNN based SPSS system. The performance of the acoustic models with position features at different levels are compared. Also, duration models with different network architectures are presented. A method to utilize the priori knowledge that the sum of state duration of a phoneme should be equal to the phone duration is proposed and proved to have better performance in both state duration and phone duration modeling. The result shows that acoustic model with state-level position features has better performance in acoustic modeling (especially in voice/unvoice classification), which means state-level duration model still has its advantage and the duration models with the priori knowledge can result in better speech quality.


 DOI: 10.21437/Interspeech.2017-1152

Cite as: Chen, B., Lai, J., Yu, K. (2017) Comparison of Modeling Target in LSTM-RNN Duration Model. Proc. Interspeech 2017, 794-798, DOI: 10.21437/Interspeech.2017-1152.


@inproceedings{Chen2017,
  author={Bo Chen and Jiahao Lai and Kai Yu},
  title={Comparison of Modeling Target in LSTM-RNN Duration Model},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={794--798},
  doi={10.21437/Interspeech.2017-1152},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1152}
}