Discrete Duration Model for Speech Synthesis

Bo Chen, Tianling Bian, Kai Yu


The acoustic model and the duration model are the two major components in statistical parametric speech synthesis (SPSS) systems. The neural network based acoustic model makes it possible to model phoneme duration at phone-level instead of state-level in conventional hidden Markov model (HMM) based SPSS systems. Since the duration of phonemes is countable value, the distribution of the phone-level duration is discrete given the linguistic features, which means the Gaussian hypothesis is no longer necessary. This paper provides an investigation on the performance of LSTM-RNN duration model that directly models the probability of the countable duration values given linguistic features using cross entropy as criteria. The multi-task learning is also experimented at the same time, with a comparison to the standard LSTM-RNN duration model in objective and subjective measures. The result shows that directly modeling the discrete distribution has its benefit and multi-task model achieves better performance in phone-level duration modeling.


 DOI: 10.21437/Interspeech.2017-1144

Cite as: Chen, B., Bian, T., Yu, K. (2017) Discrete Duration Model for Speech Synthesis. Proc. Interspeech 2017, 789-793, DOI: 10.21437/Interspeech.2017-1144.


@inproceedings{Chen2017,
  author={Bo Chen and Tianling Bian and Kai Yu},
  title={Discrete Duration Model for Speech Synthesis},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={789--793},
  doi={10.21437/Interspeech.2017-1144},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1144}
}