Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions

Yu Gu, Yongguo Kang


This paper introduces an improved generative model for statistical parametric speech synthesis (SPSS) based on WaveNet under a multi-task learning framework. Different from the original WaveNet model, the proposed Multi-task WaveNet employs the frame-level acoustic feature prediction as the secondary task and the external fundamental frequency prediction model for the original WaveNet can be removed. Therefore the improved WaveNet can generate high-quality speech waveforms only conditioned on linguistic features. Multi-task WaveNet can produce more natural and expressive speech by addressing the pitch prediction error accumulation issue and possesses more succinct inference procedures than the original WaveNet. Experimental results prove that the SPSS method proposed in this paper can achieve better performance than the state-of-the-art approach utilizing the original WaveNet in both objective and subjective preference tests.


 DOI: 10.21437/Interspeech.2018-1506

Cite as: Gu, Y., Kang, Y. (2018) Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions. Proc. Interspeech 2018, 2007-2011, DOI: 10.21437/Interspeech.2018-1506.


@inproceedings{Gu2018,
  author={Yu Gu and Yongguo Kang},
  title={Multi-task WaveNet: A Multi-task Generative Model for Statistical Parametric Speech Synthesis without Fundamental Frequency Conditions},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2007--2011},
  doi={10.21437/Interspeech.2018-1506},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1506}
}