An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis

Xin Wang, Shinji Takaki, Junichi Yamagishi


A recurrent-neural-network-based F0 model for text-to-speech (TTS) synthesis that generates F0 contours given textual features is proposed. In contrast to related F0 models, the proposed one is designed to learn the temporal correlation of F0 contours at multiple levels. The frame-level correlation is covered by feeding back the F0 output of the previous frame as the additional input of the current frame; meanwhile, the correlation over long-time spans is similarly modeled but by using F0 features aggregated over the phoneme and syllable. Another difference is that the output of the proposed model is not the interpolated continuous-valued F0 contour but rather a sequence of discrete symbols, including quantized F0 levels and a symbol for the unvoiced condition. By using the discrete F0 symbols, the proposed model avoids the influence of artificially interpolated F0 curves. Experiments demonstrated that the proposed F0 model, which was trained using a dropout strategy, generated smooth F0 contours with relatively better perceived quality than those from baseline RNN models.


 DOI: 10.21437/Interspeech.2017-246

Cite as: Wang, X., Takaki, S., Yamagishi, J. (2017) An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis. Proc. Interspeech 2017, 1059-1063, DOI: 10.21437/Interspeech.2017-246.


@inproceedings{Wang2017,
  author={Xin Wang and Shinji Takaki and Junichi Yamagishi},
  title={An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1059--1063},
  doi={10.21437/Interspeech.2017-246},
  url={http://dx.doi.org/10.21437/Interspeech.2017-246}
}