Multi-Speaker Neural Vocoder

Oriol Barbany, Antonio Bonafonte, Santiago Pascual


Statistical Parametric Speech Synthesis (SPSS) offers more flexibility than unit-selection based speech synthesis, which was the dominant commercial technology during the 2000s decade. However, classical SPSS systems generate speech with lower naturalness than unit-selection methods. Deep learning based SPSS, thanks to recurrent architectures, surpasses classical SPSS limits. These architectures offer high quality speech while preserving the desired flexibility in choosing the parameters such as the speaker, the intonation, etc. This paper exposes two proposals conceived to improve deep learning-based text-to-speech systems. First a baseline model, obtained by adapting SampleRNN, making it as a speaker-independent neural vocoder that generates the speech waveform from acoustic parameters. Then two approaches are proposed to improve the quality, applying speaker dependent normalization of the acoustic features, and the look ahead, consisting on feeding acoustic features of future frames to the network with the aim of better modeling the present waveform and avoiding possible discontinuities. Human listeners prefer the system that combines both techniques, which reaches a rate of 4 in the mean opinion score scale (MOS) with the balanced dataset and outperforms the other models.


 DOI: 10.21437/IberSPEECH.2018-7

Cite as: Barbany, O., Bonafonte, A., Pascual, S. (2018) Multi-Speaker Neural Vocoder. Proc. IberSPEECH 2018, 30-34, DOI: 10.21437/IberSPEECH.2018-7.


@inproceedings{Barbany2018,
  author={Oriol Barbany and Antonio Bonafonte and Santiago Pascual},
  title={{Multi-Speaker Neural Vocoder}},
  year=2018,
  booktitle={Proc. IberSPEECH 2018},
  pages={30--34},
  doi={10.21437/IberSPEECH.2018-7},
  url={http://dx.doi.org/10.21437/IberSPEECH.2018-7}
}