A New Glottal Neural Vocoder for Speech Synthesis

Yang Cui, Xi Wang, Lei He, Frank K. Soong

Direct modeling of waveform generation for speech synthesis, e.g. WaveNet, has made significant progress on improving the naturalness and clarity of TTS. Such deep neural network-based models can generate highly realistic speech but at high computational and memory costs. We propose here a novel neural glottal vocoder which tends to bridge the gap between the traditional parametric vocoder and end-to-end speech sample generation. In the analysis, speech signals are decomposed into corresponding glottal source signals and vocal tract filters by the glottal inverse filtering. Glottal pulses are parameterized into energy, DCT coefficients (shape) and phase. The phase trajectory of successive glottal pulses is rendered with a trainable weighting matrix to keep a smooth pitch synchronous phase trajectory. We design a hybrid, i.e., both feed-forward and recurrent, neural network to reconstruct the glottal waveform including the optimized weighting matrix. Speech is then synthesized by filtering the generated glottal waveform with the vocal tract filter. The new neural glottal vocoder can generate high-quality speech with efficient computations. Subjective tests show that it gets an MOS score of 4.12 and 75% preference over the conventional glottal vocoder with a perceived quality comparable to WaveNet and natural recording in analysis-by-synthesis.

 DOI: 10.21437/Interspeech.2018-1757

Cite as: Cui, Y., Wang, X., He, L., Soong, F.K. (2018) A New Glottal Neural Vocoder for Speech Synthesis. Proc. Interspeech 2018, 2017-2021, DOI: 10.21437/Interspeech.2018-1757.

  author={Yang Cui and Xi Wang and Lei He and Frank K. Soong},
  title={A New Glottal Neural Vocoder for Speech Synthesis},
  booktitle={Proc. Interspeech 2018},