Spanish Statistical Parametric Speech Synthesis Using a Neural Vocoder

Antonio Bonafonte, Santiago Pascual, Georgina Dorca

During the 2000s decade, unit-selection based text-to-speech was the dominant commercial technology. Meanwhile, the TTS research community has made a big effort to push statistical-parametric speech synthesis to get similar quality and more flexibility on the generated voice. During last years, deep learning advances applied to speech synthesis have filled the gap, specially when neural vocoders substitute traditional signal-processing based vocoders. In this paper we substitute the waveform generation vocoder of MUSA, our Spanish TTS, with SampleRNN, a neural vocoder which was recently proposed as a deep autoregressive raw waveform generation model. MUSA uses recurrent neural networks to predict vocoder parameters (MFCC and logF0) from linguistic features. Then, the Ahocoder vocoder is used to recover the speech waveform out of the predicted parameters. In the first system SampleRNN is extended to generate speech conditioned on the Ahocoder generated parameters, where two configurations have been considered to train the system. First, the parameters derived from the signal using Ahocoder are used. Secondly, the system is trained with the parameters predicted by MUSA, where SampleRNN and MUSA are jointly optimized. The subjective evaluation shows that the second system outperforms both the original Ahocoder and SampleRNN as an independent neural vocoder.

 DOI: 10.21437/Interspeech.2018-2417

Cite as: Bonafonte, A., Pascual, S., Dorca, G. (2018) Spanish Statistical Parametric Speech Synthesis Using a Neural Vocoder. Proc. Interspeech 2018, 1998-2001, DOI: 10.21437/Interspeech.2018-2417.

  author={Antonio Bonafonte and Santiago Pascual and Georgina Dorca},
  title={Spanish Statistical Parametric Speech Synthesis Using a Neural Vocoder},
  booktitle={Proc. Interspeech 2018},